4 Working With Data
You can download the demo data files fro various types from the link below:-
CLICK here to open the datasets folder
4.1 Importing Data into R
4.1.1 Import flat files (.csv
, .txt
)
read.table()
is the main functionread.csv()
wrapper for CSVread.delim()
wrapper for tab-delimited files
# install th read r package
install.packages(readr)
# load the read r package
library(readr)
# read csv file using read.table()
df1 <- read_csv("data/demo csv data.csv", show_col_types = FALSE)
# read txt file that doesnot headers
df2 <- read_delim("data/demo data text.txt")
# Or use
df3 <- read_tsv("data/demo data text.txt")
4.1.2 Import Excel data file
# install package
install.packages("readxl")
# load package
library(readxl)
# by default, it loads the first sheet
excel_data <- read_excel("data/demo excel data.xlsx")
# if the excell workbook has many sheets and you want to
# load a specific one, e,g, sheet two
excel_data <- read_excel("data/demo excel data.xlsx", sheet = 2)
4.1.3 Import STATA data file
install.packages("readstata13")
library(readstata13)
stata_data <- read.dta13("data/demo stata data.dta", generate.factors = TRUE)
4.1.4 Importing R data file
load("data/demo R data.rda")
4.1.5 Importing Web data file
Data on the web comes in several modes, for example:
files that you can download
APIs
content such as HTML tables
custom data browsers
and more.
However, for this section, let us keep it basic, If interested is the other option, contact the author for some materials OR code that might not be in this book.
4.1.6 Inbuilt Datasets
To see the list of pre-loaded datasets, type the function data()
In case you want to get a list of datasets in a specific R package using data()
# list available datasets in ggplot2 package
data(package = "ggplot2")
4.1.7 Other file formats
File Type | File extension | function | package |
---|---|---|---|
Database file type | .dbf | read.dbf() | foreign |
SPSS data file | .sav | read_spss() | haven |
SQL Database |
|
DBI, RMySQL, RPostgreSQL, ROracle depending on the specific database | |
SAS data file | .xport | read_sas() | haven |
Matlab file | readMat() | R.matlap |
4.2 Explore the dataset
In R,There are various functions to explore a data frame. These functions can help you get a better understanding of the structure and contents of your data frame.
Here are some common functions.
-
head()
- shows the first few rows of the data frame
-
tail()
- shows the last few rows of the data frame
-
str()
- displays the structure of the data frame
-
summary()
- provides summary statistics for each variable in the data frame
-
nrow()
- returns the number of rows in the data frame
-
ncol()
- returns the number of columns in the data frame
-
colnames()
- returns the names of the columns in the data frame
-
row.names()
- returns the names of the rows in the data frame
unique()
- returns unique values in a column of the data frameview()
- shows the dataset in spreadsheet formdim()
- returns the dimensions of the data frame as a vector of two integers, the number of rows and the number of columnsis.na()
- returns a logical vector indicating whether each value in the data frame is missing or not
For a better visual display of the results, additional packages can be used such as skimr, gtsummary, kable
, as it will be in the preceding sections of this guide, however, below is a sneak peak of the basic ones.
install.packages("skimr")
library(skimr)
# get the summary of all the variable, alternative to
# ---str()---
skim(iris)
4.3 Pull basic statistics from your data
4.3.1 For continuous variables:
-
mean()
- calculates the mean of a numeric vector
mean(numeric_vector)
-
median()
- calculates the median of a numeric vector
median(numeric_vector)
-
sd()
- calculates the standard deviation of a numeric vector
sd(numeric_vector)
-
var()
- calculates the variance of a numeric vector
-
min()
- returns the minimum value of a numeric vector
-
max()
- returns the maximum value of a numeric vector
-
range()
- returns the range of a numeric vector as a vector of length 2 containing the minimum and maximum values
-
quantile()
- calculates the quantiles of a numeric vector
summary()
- provides summary statistics for a numeric vector, including the minimum, 1st quartile, median, mean, 3rd quartile, and maximumSimple plots using the base package
# Histogram:
hist(data$variable_name)
# Boxplot:
boxplot(data$variable_name)
# Density Plot:
plot(density(data$variable_name))
# Scatterplot:
plot(data$variable_name1, data$variable_name2)
For categorical/character variables:
-
table()
- creates a contingency table of the frequencies of the levels of a categorical variable
-
prop.table()
- converts a contingency table of frequencies to a contingency table of proportions
summary()
- provides a summary of a factor variable, including the number of levels and the frequency of the most common level.Simple plot using the base package
4.4 Exporting Data from R
save(data_object, file = "my_data.rda") # exporting R data object
write.csv(data_object, file = "my_data.csv") # exporting csv data file
write.table(data_object, file = "my_data.txt") # exporting text data file
writexl::write_xlsx(data_object, file = "my_data.rda") # exporting excel data file
jsonlite::write_json(data_object, file = "my_data.json") # exporting java script data file
xml2::write_xml(data_object, file = "my_data.xml") # exporting to xml file
haven::write_sas(data_object, file = "my_data.xport") # exporting SAS data file
haven::write_dta(data_object, file = "my_data.dta") # exporting STATA data file
haven::write_sav(data_object, file = "my_data.sav") # exporting SPSS data file
You can download the demo data files fro various types from the link below:-
4.5 Tasks
4.5.1 Exercise 1
-
Import a CSV file named “sales_data.csv” into RStudio.
NOTE: You can download the sales data to your computer from; Click here Download the Sales Data
Export the data in the “sales_data” object as a CSV file named “sales_data_export.csv”.
Using the “summary” function, extract basic statistics (mean, median, minimum, maximum) for the “sales_data” object.
Load a new data file named “customer_data.xlsx” into RStudio.
Explore the data by using the “head” and “tail” functions to view the first and last few rows of the data.
Use the “str” function to display the structure of the “customer_data” object.
Pull basic statistics (mean, median, minimum, maximum) for the “customer_data” object using the “summary” function.
Export the data in the “customer_data” object as a CSV file named “customer_data_export.csv”.
4.5.2 Exercise 2
Import the inbuilt data of iris
into RStudio.
Using the “plot” function, create a scatter plot of the “Sepal.Length” and “Sepal.Width” columns in the “iris” object.
Using the “boxplot” function, create a box plot of the “Petal.Length” column in the “iris” object, grouped by the “Species” column.
Using the “hist” function, create a histogram of the “Petal.Width” column in the “iris” object, grouped by the “Species” column.
Using the “plot” function, create a line plot of the “Sepal.Length” column in the “iris” object, grouped by the “Species” column.