5 Data Wrangling
Data you import will rarely be in a format necessary for analysis, and you will need to manipulate it before exploring the questions you are interested in. This may take more time than doing the analysis itself! In this section, we will learn some basic techniques for manipulating, managing, and wrangling with our data in R. Specifically, we will rely on the philosophy of “tidy data” as articulated by Hadley Wickham.
According to Wickham, data is “tidy” when it meets three key criteria:
Each observation is in a row.
Each variable is in a column.
Each value has its own cell.
Being observant of these criteria allows us to recognize when data is organized or unorganized. It also provides us a standardized schema and set of tools for cleaning up.
some of the most common ways that datasets are “messy:”
Column headers are values, not variable names.
Multiple variables are stored in one column.
Variables are stored in both rows and columns.
Multiple types of observational units are stored in the same table.
A single observational unit is stored in multiple tables.
Find the cheat sheet HERE
5.1 R Data Wrangling Packages
While there are tons of R packages out there to help you work with data, we’re going to cover the packages and functions within those packages that you’ll absolutely want and need to work with when working with data.
5.1.1 dplyr
There is a package specifically designed for helping you wrangle your data. This package is called dplyr
and will allow you to easily accomplish many of the data wrangling tasks necessary. Like tidyr
, this package is a core package within the tidyverse, and thus it was loaded in for you when you ran library(tidyverse)
earlier. We will cover a number of functions that will help you wrangle data using dplyr
:
%>%
- pipe operator for chaining a sequence of operationsglimpse()
- get an overview of what’s included in datasetfilter()
- filter rowsselect()
- select, rename, and reorder columnsrename()
- rename columnsarrange()
- reorder rowsmutate()
- create a new columngroup_by()
- group variablessummarize()
- summarize information within a datasetleft_join()
- combine data across data frametally()
- get overall sum of values of specified column(s) or the number of rows of tibblecount()
- get counts of unique values of specified column(s) (shortcut ofgroup_by()
andtally()
)add_count()
- add values ofcount()
as a new columnadd_tally()
- add value(s) oftally()
as a new column
5.1.2 tidyr
We will also cover the tidyr
package. The main functions we’ll cover from tidyr
are:
unite()
- combine contents of two or more columns into a single columnseparate()
- separate contents of a column into two or more columns
5.1.3 janitor
The third package we’ll include here is the janitor
package. While not a core tidyverse package, this tidyverse-adjacent package provides tools for cleaning messy data. The main functions we’ll cover from janitor are:
clean_names()
- clean names of a data frametabyl()
- get a helpful summary of a variableget_dupes()
- identify duplicate observations
5.1.4 skimr
The final package we’ll discuss here is the skimr
package. This package provides a quick way to summarize a data.frame or tibble within the tidy data framework. We’ll discuss its most useful function here:
-
skim()
- summarize a data frame
If you have not already installed these packages, you’ll want to be sure these packages are are installed
install.packages("pacman") # Install pacman package
library("pacman") # Load pacman package
p_load(skimr, tidyr, janitor, dplyr)
NOTE: The package package provides tools to conveniently manage add-on packages in the R programming language.
After executing the previous code, we have loaded the four packages.
The
p_load
function also checks whether a package is installed already. If the package is not installed yet, it is installed automatically by thepacman
package.The
p_load
function basically replaces the Base R library, install.packages, and require functions and allows installing and loading numerous packages simultaneously using much less R code.
5.2 The Pipe Operator
Before we get into the important functions within dplyr
, it will be very useful to discuss what is known as the pipe operator.
The pipe operator looks like this in R: %>%
. Whenever you see the pipe %>%
, think of the word “then,”.
Generally, the pipe operator allows you to chain a number of different functions together in a particular order. If you wanted to take data frame A and carry out function B on it in R, you could depict this with an arrow pointing from A to B:
In base R syntax, it would be carried out by calling the function B on the data frame object A:
B(A)
Alternatively, you could use the pipe operator (%>%
):
A %>%
B
However, often you are not performing just one action on a data frame, but rather you are looking to carry out multiple functions.
Say that you want to take data frame A and carry out function B, then you want to take the output from that and then carry out function C. Subsequently you want to take the output of that and then carry out function D.
In R syntax, we would first apply function B to data frame A, then apply function C to this output, then apply function D to this output. This results in the following syntax that is hard to read because multiple calls to functions are nested within each other:
Alternatively, you could use the pipe operator. Each time you want take the output of one function and carry out something new on that output, you will use the pipe operator:
And, even more readable is when each of these steps is indented out onto its own individual line of code:
While both of the previous two code examples would provide the same output, the one below is more readable, which is a large part of why pipes are used. It makes your code more understandable to you and others.
Below we’ll use this pipe operator a lot. Remember, it takes output from the left hand side and feeds it into the function that comes after the pipe as the first argument.
5.3 Filtering Data
When working with a large dataset, you’re often interested in only working with a portion of the data at any one time.
For example, if you had data on people from ages 0 to 100 years old, but you wanted to ask a question that only pertained to children, you would likely want to only work with data from those individuals who were less than 18 years old.
To do this, you would want to filter your dataset to only include data from these select individuals.
For the examples below, we’ll be using a dataset from the ggplot2
package called msleep
. (You’ll learn more about this package in a later course on data visualization) For now, it’s a core tidyverse package so it’s loaded in along with the other tidyverse packages using library(tidyverse)
.)
This dataset includes sleep times and weights from a number of different mammals. It has 83 rows, with each row including information about a different type of animal, and 11 variables. As each row is a different animal and each column includes information about that animal, this is a wide dataset.
To get an idea of what variables are included in this data frame, you can use glimpse()
. This function summarizes how many rows there are (Observations
) and how many columns there are (Variables
). Additionally, it gives you a glimpse into the type of data contained in each column.
Specifically, in this dataset, we know that the first column is name
and that it contains a character vector (chr
) and that the first three entries are “Cheetah,” “Owl monkey,” and “Mountain beaver.” It works similarly to the base R summary()
function.
## Rows: 83
## Columns: 11
## $ name <chr> "Cheetah", "Owl monkey", "Mountain be…
## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "B…
## $ vore <chr> "carni", "omni", "herbi", "omni", "he…
## $ order <chr> "Carnivora", "Primates", "Rodentia", …
## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated",…
## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.…
## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA,…
## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7…
## $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3,…
## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA…
## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000,…
Glimpse of msleep dataset
5.3.1 Filtering Rows
If you were only interested in learning more about the sleep times of “Primates,” we could filter this dataset to include only data about those mammals that are also Primates. As we can see from glimpse()
, this information is contained within the order
variable. So to do this within R, we use the following syntax:
# filter to only include primates
filter(msleep, order == "Primates")
## # A tibble: 12 × 11
## name genus vore order conse…¹ sleep…² sleep…³ sleep…⁴
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Owl mo… Aotus omni Prim… <NA> 17 1.8 NA
## 2 Grivet Cerc… omni Prim… lc 10 0.7 NA
## 3 Patas … Eryt… omni Prim… lc 10.9 1.1 NA
## 4 Galago Gala… omni Prim… <NA> 9.8 1.1 0.55
## 5 Human Homo omni Prim… <NA> 8 1.9 1.5
## 6 Mongoo… Lemur herbi Prim… vu 9.5 0.9 NA
## 7 Macaque Maca… omni Prim… <NA> 10.1 1.2 0.75
## 8 Slow l… Nyct… carni Prim… <NA> 11 NA NA
## 9 Chimpa… Pan omni Prim… <NA> 9.7 1.4 1.42
## 10 Baboon Papio omni Prim… <NA> 9.4 1 0.667
## 11 Potto Pero… omni Prim… lc 11 NA NA
## 12 Squirr… Saim… omni Prim… <NA> 9.6 1.4 NA
## # … with 3 more variables: awake <dbl>, brainwt <dbl>,
## # bodywt <dbl>, and abbreviated variable names
## # ¹conservation, ²sleep_total, ³sleep_rem, ⁴sleep_cycle
## # A tibble: 12 × 11
## name genus vore order conse…¹ sleep…² sleep…³ sleep…⁴
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Owl mo… Aotus omni Prim… <NA> 17 1.8 NA
## 2 Grivet Cerc… omni Prim… lc 10 0.7 NA
## 3 Patas … Eryt… omni Prim… lc 10.9 1.1 NA
## 4 Galago Gala… omni Prim… <NA> 9.8 1.1 0.55
## 5 Human Homo omni Prim… <NA> 8 1.9 1.5
## 6 Mongoo… Lemur herbi Prim… vu 9.5 0.9 NA
## 7 Macaque Maca… omni Prim… <NA> 10.1 1.2 0.75
## 8 Slow l… Nyct… carni Prim… <NA> 11 NA NA
## 9 Chimpa… Pan omni Prim… <NA> 9.7 1.4 1.42
## 10 Baboon Papio omni Prim… <NA> 9.4 1 0.667
## 11 Potto Pero… omni Prim… lc 11 NA NA
## 12 Squirr… Saim… omni Prim… <NA> 9.6 1.4 NA
## # … with 3 more variables: awake <dbl>, brainwt <dbl>,
## # bodywt <dbl>, and abbreviated variable names
## # ¹conservation, ²sleep_total, ³sleep_rem, ⁴sleep_cycle
Note that; we are using the equality ==
comparison operator that you learned about in the previous topic about operators. Also note that we have used the pipe operator to feed the msleep
data frame into the filter()
function.
The reason i urge you to use the piping operators (%>%
)! is that the code is highly readable. This is why we use the pipe (%>%
)!
Now, we have a smaller dataset of only 12 mammals (as opposed to the original 83) and we can see that the order
variable column only includes “Primates.”
But, what if we were only interested in Primates who sleep more than 10 hours total per night? This information is in the sleep_total
column. Fortunately, filter()
also works on numeric variables.
To accomplish this, you would use the following syntax, separating the multiple filters you want to apply with a comma:
## # A tibble: 5 × 11
## name genus vore order conse…¹ sleep…² sleep…³ sleep…⁴
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Owl mon… Aotus omni Prim… <NA> 17 1.8 NA
## 2 Patas m… Eryt… omni Prim… lc 10.9 1.1 NA
## 3 Macaque Maca… omni Prim… <NA> 10.1 1.2 0.75
## 4 Slow lo… Nyct… carni Prim… <NA> 11 NA NA
## 5 Potto Pero… omni Prim… lc 11 NA NA
## # … with 3 more variables: awake <dbl>, brainwt <dbl>,
## # bodywt <dbl>, and abbreviated variable names
## # ¹conservation, ²sleep_total, ³sleep_rem, ⁴sleep_cycle
Note that we have used the “greater than” comparison operator with sleep_total
.
Now, we have a dataset focused in on only 5 mammals, all of which are primates who sleep for more than 10 hours a night total.
We can obtain the same result with the AND &
logical operator
## # A tibble: 5 × 11
## name genus vore order conse…¹ sleep…² sleep…³ sleep…⁴
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Owl mon… Aotus omni Prim… <NA> 17 1.8 NA
## 2 Patas m… Eryt… omni Prim… lc 10.9 1.1 NA
## 3 Macaque Maca… omni Prim… <NA> 10.1 1.2 0.75
## 4 Slow lo… Nyct… carni Prim… <NA> 11 NA NA
## 5 Potto Pero… omni Prim… lc 11 NA NA
## # … with 3 more variables: awake <dbl>, brainwt <dbl>,
## # bodywt <dbl>, and abbreviated variable names
## # ¹conservation, ²sleep_total, ³sleep_rem, ⁴sleep_cycle
Note that the number of columns hasn’t changed. All 11 variables are still shown in columns because the function filter()
filters on rows, not columns.
5.3.2 Selecting Columns
While filter()
operates on rows, it is possible to filter your dataset to only include the columns you’re interested in. To select columns so that your dataset only includes variables you’re interested in, you will use select()
.
Let’s start with the code we just wrote to only include primates who sleep a lot. What if we only want to include the first column (the name of the mammal) and the sleep information (included in the columns sleep_total
, sleep_rem
, and sleep_cycle
)?
We would do this by starting with the code we just used, adding another pipe, and using the function select()
. Within select()
, we specify which columns we want in our output.
msleep %>%
filter(order == "Primates", sleep_total > 10) %>%
select(name, sleep_total, sleep_rem, sleep_cycle)
## # A tibble: 5 × 4
## name sleep_total sleep_rem sleep_cycle
## <chr> <dbl> <dbl> <dbl>
## 1 Owl monkey 17 1.8 NA
## 2 Patas monkey 10.9 1.1 NA
## 3 Macaque 10.1 1.2 0.75
## 4 Slow loris 11 NA NA
## 5 Potto 11 NA NA
Now, using select()
we see that we still have the five rows we filtered to before, but we only have the four columns specified using select()
.
Here you can hopefully see the power of the pipe operator to chain together several commands in a row. Without the pipe operator, the full command would look like this below:
select(filter(msleep, order == "Primates", sleep_total > 10), c("name", "sleep_total", "sleep_rem", "sleep_cycle"))
## # A tibble: 5 × 4 ## name sleep_total sleep_rem sleep_cycle ## <chr> <dbl> <dbl> <dbl> ## 1 Owl monkey 17 1.8 NA ## 2 Patas monkey 10.9 1.1 NA ## 3 Macaque 10.1 1.2 0.75 ## 4 Slow loris 11 NA NA ## 5 Potto 11 NA NA
> > YOU SEE!!!
Definitely harder to read. We’ll stick with the above approach!
5.4 Renaming Columns
select()
can also be used to rename columns. To do so, you use the syntax: new_column_name = old_column_name
within select
. For example, to select the same columns and rename them total
, rem
and cycle
, you would use the following syntax:
msleep %>%
filter(order == "Primates", sleep_total > 10) %>%
select(name, total = sleep_total, rem = sleep_rem, cycle = sleep_cycle)
## # A tibble: 5 × 4
## name total rem cycle
## <chr> <dbl> <dbl> <dbl>
## 1 Owl monkey 17 1.8 NA
## 2 Patas monkey 10.9 1.1 NA
## 3 Macaque 10.1 1.2 0.75
## 4 Slow loris 11 NA NA
## 5 Potto 11 NA NA
It’s important to keep in mind that when using select()
to rename columns, only the specified columns will be included and renamed in the output. If you, instead, want to change the names of a few columns but return all columns in your output, you’ll want to use rename()
.
For example, the following, returns a data frame with all 11 columns, where the column names for three columns specified within rename()
function have been renamed.
msleep %>%
filter(order == "Primates", sleep_total > 10) %>%
rename(total = sleep_total, rem = sleep_rem, cycle = sleep_cycle)
## # A tibble: 5 × 11
## name genus vore order conse…¹ total rem cycle awake
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Owl mon… Aotus omni Prim… <NA> 17 1.8 NA 7
## 2 Patas m… Eryt… omni Prim… lc 10.9 1.1 NA 13.1
## 3 Macaque Maca… omni Prim… <NA> 10.1 1.2 0.75 13.9
## 4 Slow lo… Nyct… carni Prim… <NA> 11 NA NA 13
## 5 Potto Pero… omni Prim… lc 11 NA NA 13
## # … with 2 more variables: brainwt <dbl>, bodywt <dbl>, and
## # abbreviated variable name ¹conservation
5.5 Reordering
In addition to filtering rows and columns, often, you’ll want the data arranged in a particular order. It may order the columns in a logical way, or it could be to sort the data so that the data are sorted by value, i.e Ascending OR Descending. All of this can be achieved with a few simple functions.
5.5.0.1 Reordering Columns
The select()
function is powerful. Not only will it filter and rename columns, but it can also be used to reorder your columns. Using our example from above, if you wanted sleep_rem
to be the first sleep column and sleep_total
to be the last column, all you have to do is reorder them within select()
. The output from select()
would then be reordered to match the order specified within select()
.
msleep %>%
filter(order == "Primates", sleep_total > 10) %>%
select(name, sleep_rem, sleep_cycle, sleep_total)
## # A tibble: 5 × 4
## name sleep_rem sleep_cycle sleep_total
## <chr> <dbl> <dbl> <dbl>
## 1 Owl monkey 1.8 NA 17
## 2 Patas monkey 1.1 NA 10.9
## 3 Macaque 1.2 0.75 10.1
## 4 Slow loris NA NA 11
## 5 Potto NA NA 11
sleep_rem name
will be displayed first followed by sleep_rem
, sleep_cycle
, and sleep_total
, just as it was specified within select()
.
5.5.0.2 Reordering Rows
Rows can also be reordered. To reorder a variable in ascending order , you’ll want to use arrange()
. Continuing on from our example above, to now sort our rows by the amount of total sleep each mammal gets, we would use the following syntax:
msleep %>%
filter(order == "Primates", sleep_total > 10) %>%
select(name, sleep_rem, sleep_cycle, sleep_total) %>%
arrange(sleep_total)
## # A tibble: 5 × 4
## name sleep_rem sleep_cycle sleep_total
## <chr> <dbl> <dbl> <dbl>
## 1 Macaque 1.2 0.75 10.1
## 2 Patas monkey 1.1 NA 10.9
## 3 Slow loris NA NA 11
## 4 Potto NA NA 11
## 5 Owl monkey 1.8 NA 17
While arrange
sorts variables in ascending order, it’s also possible to sort in descending order. To do this you just use desc()
with the following syntax:
msleep %>%
filter(order == "Primates", sleep_total > 10) %>%
select(name, sleep_rem, sleep_cycle, sleep_total) %>%
arrange(desc(sleep_total))
## # A tibble: 5 × 4
## name sleep_rem sleep_cycle sleep_total
## <chr> <dbl> <dbl> <dbl>
## 1 Owl monkey 1.8 NA 17
## 2 Slow loris NA NA 11
## 3 Potto NA NA 11
## 4 Patas monkey 1.1 NA 10.9
## 5 Macaque 1.2 0.75 10.1
By putting sleep_total
within desc()
, arrange()
will now sort your data from the primates with the longest total sleep to the shortest.
arrange()
can also be used to order non-numeric variables. For example, arrange()
will sort character vectors alphabetically.
msleep %>%
filter(order == "Primates", sleep_total > 10) %>%
select(name, sleep_rem, sleep_cycle, sleep_total) %>%
arrange(name)
## # A tibble: 5 × 4
## name sleep_rem sleep_cycle sleep_total
## <chr> <dbl> <dbl> <dbl>
## 1 Macaque 1.2 0.75 10.1
## 2 Owl monkey 1.8 NA 17
## 3 Patas monkey 1.1 NA 10.9
## 4 Potto NA NA 11
## 5 Slow loris NA NA 11
If you would like to reorder rows based on information in multiple columns, you can specify them separated by commas. This is useful if you have repeated labels in one column and want to sort within a category based on information in another column.
In the example here, if there were repeated primates, this would sort the repeats based on their total sleep.
msleep %>%
filter(order == "Primates", sleep_total > 10) %>%
select(name, sleep_rem, sleep_cycle, sleep_total) %>%
arrange(name, sleep_total)
## # A tibble: 5 × 4
## name sleep_rem sleep_cycle sleep_total
## <chr> <dbl> <dbl> <dbl>
## 1 Macaque 1.2 0.75 10.1
## 2 Owl monkey 1.8 NA 17
## 3 Patas monkey 1.1 NA 10.9
## 4 Potto NA NA 11
## 5 Slow loris NA NA 11
5.6 Creating New Columns
You will often find when working with data that you need an additional column.
For example, if you had two datasets you wanted to combine, you may want to make a new column in each dataset called dataset
. In one dataset you may put datasetA
in each row. In the second dataset, you could put datasetB
. This way, once you combined the data, you would be able to keep track of which dataset each row came from originally.
More often, however, you’ll likely want to create a new column that calculates a new variable based on information in a column you already have. For example, in our mammal sleep dataset, sleep_total
is in hours. What if you wanted to have that information in minutes? You could create a new column with this very information!
The function mutate()
was made for all of these new-column-creating situations. This function has a lot of capabilities. We’ll cover the basics here.
Returning to the msleep
dataset, after filtering and re-ordering, we can create a new column with mutate()
. Within mutate()
, we will calculate the number of minutes each mammal sleeps by multiplying the number of hours each animal sleeps by 60 minutes.
msleep %>%
filter(order == "Primates", sleep_total > 10) %>%
select(name, sleep_rem, sleep_cycle, sleep_total) %>%
arrange(name) %>%
mutate(sleep_total_min = sleep_total * 60)
## # A tibble: 5 × 5
## name sleep_rem sleep_cycle sleep_total sleep_tot…¹
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Macaque 1.2 0.75 10.1 606
## 2 Owl monkey 1.8 NA 17 1020
## 3 Patas monkey 1.1 NA 10.9 654
## 4 Potto NA NA 11 660
## 5 Slow loris NA NA 11 660
## # … with abbreviated variable name ¹sleep_total_min
5.7 Tasks
5.7.1 Exercise 1:
Basic Data Manipulation
In this exercise, you’ll work with a toy data set showing product purchases from an electronics store. While the data set is small and simple, it still illustrates many of the challenges you have to address in real-world data wrangling!
Clean this data set to make it easier to visualize and analyze.
download the data from HERE
Problem 1
Download the data, setup you workspace, i.e. set up working directory, add the dataset to the directory, create a project and r script or r markdown, install required packages, and Load the data in RStudio
dt <- read_csv("refine_original_toy_data")
Problem 2
1: Clean up brand names Clean up the ‘company’ column so all of the misspellings of the brand names are standardized. For example, you can transform the values in the column to be: philips, akzo, van houten and unilever (all lowercase).
dt <- dt %>%
mutate(company = ifelse(grepl("^phil|^fil|^phl", company,
ignore.case = TRUE), "philips", company)) %>%
mutate(company = ifelse(grepl("^ak", company, ignore.case = TRUE),
"akzo", company)) %>%
mutate(company = ifelse(grepl("^van", company, ignore.case = TRUE),
"van_houten", company)) %>%
mutate(company = ifelse(grepl("^uni", company, ignore.case = TRUE),
"unilever", company))
Problem 3
2: Separate product code and number Separate the product code and product number into separate columns i.e. add two new columns called product_code and product_number, containing the product code and number respectively.
dt <- dt %>%
separate(`Product code / number`, into = c("product_code",
"Product_number"), sep = "-")
Problem 4
3: Add product categories You learn that the product codes actually represent the following product categories: p = Smartphone v = TV x = Laptop q = Tablet In order to make the data more readable, add a column with the product category for each record.
dt <- dt %>%
mutate(product_category = ifelse(product_code == "p", "Smartphone",
"")) %>%
mutate(product_category = ifelse(product_code == "x", "Laptop",
product_category)) %>%
mutate(product_category = ifelse(product_code == "v", "TV",
product_category)) %>%
mutate(product_category = ifelse(product_code == "q", "Tablet",
product_category))
Problem 5
4: Add full address for geocoding You’d like to view the customer information on a map. In order to do that, the addresses need to be in a form that can be easily geocoded. Create a new column full_address that concatenates the three address fields (address, city, country), separated by commas.
Problem 6
5: Create dummy variables for company and product category Both the company name and product category are categorical variables i.e. they take only a fixed set of values. In order to use them in further analysis you need to create dummy variables. Create dummy binary variables for each of them with the prefix company_ and product_ i.e., *Add four binary (1 or 0) columns for company: company_philips, company_akzo, company_van_houten and company_unilever.
dt <- dt %>%
mutate(company_philips = ifelse(company == "philips", 1,
0)) %>%
mutate(company_akzo = ifelse(company == "akzo", 1, 0)) %>%
mutate(company_van_houten = ifelse(company == "van_houten",
1, 0)) %>%
mutate(company_unilever = ifelse(company == "unilever", 1,
0))
Problem 7
6: Add four binary (1 or 0) columns for product category: product_smartphone, product_tv, product_laptop and product_tablet.
5.7.2 Exercise 2
The Challenge
5.7.2.1 Problem Statement
Policy Priority: The Department of Education is focused on ensuring that parents, students, and policymakers are able to use its publicly available data to take meaningful action to improve outcomes.Supporting Decision-Making for Student Sub-populations and their Families.
Problem: Certain mobile or disconnected student populations entering or reentering the community could greatly benefit from data and resources to support their wellbeing and success. Such students and families often lack information that is necessary to distinguish between their school options, access services, and identify affordable housing near high-quality school and in safe neighborhoods that have access to transit and employment.
5.7.2.2 Use Case
Maria is a 25-year-old US Army veteran, newly returned to the civilian workforce. She has recently completed a six-year commitment with the Army. During her time in the Army, she worked in supply management and logistics. She has decided to pursue a degree in Management Systems and Information Technology.
Maria has asked you to use your skill with data to help her search for the best school for her. She is willing to relocate anywhere in the continental United States, but she has a few criteria that her ideal schools must satisfy:
safety (low crime).
urban -- Maria wants to live the big city life.
start-ups -- the school should be in a metropolitan area that ranks highly in entrepreneurialism (she plans to find an internship at a startup while she studies).
Maria would like you to help her narrow down her search to a list of schools that she can investigate more closely before making her decision.
5.7.2.3 Your Task
Produce a dataset of schools which satisfy all of Maria’s criteria
Rank them from best to worst according to the same criteria.
Maria’s schools must:
be in an urban/metropolitan area.
be in a city that ranks 75th percentile or higher on Kauffman’s start-up rankings.
be below 50th percentile in overall crime.
offer a 2-year or 4-year degree in Information Technology/Science.
All the data you need is provided HERE (the respective data Dectionaries and descriptions in pdf files) with the exception of the entrepreneurial data, which can be accessed at http://www.kauffman.org/microsites/kauffman-index/about/archive/kauffman-index-of-entrepreneurial-activity-data-files (Metro Area Components Data 2015)
Faluire to download the data, contact the Author HERE
5.7.2.4 Tips:
Read the data dictionaries to figure out what the variables mean and which ones you will need to use.
Eliminate the columns that you dont need.
Look for suitable columns to join the tables on.
Perform any cleaning and standardization needed to facilitate the joins.
Engineer a summary variable for school crime so that we can compare schools by levels of crime overall.
Eliminate from the data all the data points that fail to satisfy Maria’s criteria.
Engineer a method for ranking the schools in consideration of all of Maria’s criteria taken together.