4 Working With Data

You can download the demo data files fro various types from the link below:-

CLICK here to open the datasets folder

4.1 Importing Data into R

4.1.1 Import flat files (.csv, .txt )

# install th read r package
install.packages(readr)
# load the read r package
library(readr)
# read csv file using read.table()
df1 <- read_csv("data/demo csv data.csv", show_col_types = FALSE)
# read txt file that doesnot headers
df2 <- read_delim("data/demo data text.txt")
# Or use
df3 <- read_tsv("data/demo data text.txt")

4.1.2 Import Excel data file

# install package
install.packages("readxl")
# load package
library(readxl)
# by default, it loads the first sheet
excel_data <- read_excel("data/demo excel data.xlsx")
# if the excell workbook has many sheets and you want to
# load a specific one, e,g, sheet two
excel_data <- read_excel("data/demo excel data.xlsx", sheet = 2)

4.1.3 Import STATA data file

install.packages("readstata13")
library(readstata13)
stata_data <- read.dta13("data/demo stata data.dta", generate.factors = TRUE)

4.1.4 Importing R data file

load("data/demo R data.rda")

4.1.5 Importing Web data file

Data on the web comes in several modes, for example:

  1. files that you can download

  2. APIs

  3. content such as HTML tables

  4. custom data browsers

  5. and more.

However, for this section, let us keep it basic, If interested is the other option, contact the author for some materials OR code that might not be in this book.

4.1.6 Inbuilt Datasets

To see the list of pre-loaded datasets, type the function data()

# Loading
data(mtcars)
# learn more about the data
`?`(mtcars)
# OR
help(mtcars)

In case you want to get a list of datasets in a specific R package using data()

# list available datasets in ggplot2 package
data(package = "ggplot2")

4.1.7 Other file formats

File Type File extension function package
Database file type .dbf read.dbf() foreign
SPSS data file .sav read_spss() haven
SQL Database
  • dbConnect()

  • dbGetQuery()

DBI, RMySQL, RPostgreSQL, ROracle depending on the specific database
SAS data file .xport read_sas() haven
Matlab file readMat() R.matlap

4.2 Explore the dataset

In R,There are various functions to explore a data frame. These functions can help you get a better understanding of the structure and contents of your data frame.

Here are some common functions.

  1. head() - shows the first few rows of the data frame
  1. tail() - shows the last few rows of the data frame
  1. str() - displays the structure of the data frame
  1. summary() - provides summary statistics for each variable in the data frame
  1. nrow() - returns the number of rows in the data frame
  1. ncol() - returns the number of columns in the data frame
  1. colnames() - returns the names of the columns in the data frame
  1. row.names() - returns the names of the rows in the data frame
  1. unique() - returns unique values in a column of the data frame

  2. view() - shows the dataset in spreadsheet form

  3. dim() - returns the dimensions of the data frame as a vector of two integers, the number of rows and the number of columns

  4. is.na() - returns a logical vector indicating whether each value in the data frame is missing or not

For a better visual display of the results, additional packages can be used such as skimr, gtsummary, kable, as it will be in the preceding sections of this guide, however, below is a sneak peak of the basic ones.

install.packages("skimr")
library(skimr)
# get the summary of all the variable, alternative to
# ---str()---
skim(iris)

4.3 Pull basic statistics from your data

4.3.1 For continuous variables:

  1. mean() - calculates the mean of a numeric vector

mean(numeric_vector)

  1. median() - calculates the median of a numeric vector

median(numeric_vector)

  1. sd() - calculates the standard deviation of a numeric vector

sd(numeric_vector)

  1. var() - calculates the variance of a numeric vector
  1. min() - returns the minimum value of a numeric vector
  1. max() - returns the maximum value of a numeric vector
  1. range() - returns the range of a numeric vector as a vector of length 2 containing the minimum and maximum values
  1. quantile() - calculates the quantiles of a numeric vector
  1. summary() - provides summary statistics for a numeric vector, including the minimum, 1st quartile, median, mean, 3rd quartile, and maximum

  2. Simple plots using the base package

# Histogram:
hist(data$variable_name)
# Boxplot:
boxplot(data$variable_name)
# Density Plot:
plot(density(data$variable_name))
# Scatterplot:
plot(data$variable_name1, data$variable_name2)

For categorical/character variables:

  1. table() - creates a contingency table of the frequencies of the levels of a categorical variable
  1. prop.table() - converts a contingency table of frequencies to a contingency table of proportions
  1. summary() - provides a summary of a factor variable, including the number of levels and the frequency of the most common level.

  2. Simple plot using the base package

# Bar Plot:
barplot(table(data$variable_name))
# Pie Chart:
pie(table(data$variable_name))
# Stacked Bar Plot:
barplot(table(data$variable_name, data$grouping_variable), col = c("red",
    "blue"))

4.4 Exporting Data from R

save(data_object, file = "my_data.rda")  # exporting R data object
write.csv(data_object, file = "my_data.csv")  # exporting csv data file
write.table(data_object, file = "my_data.txt")  # exporting text data file
writexl::write_xlsx(data_object, file = "my_data.rda")  # exporting excel data file
jsonlite::write_json(data_object, file = "my_data.json")  # exporting java script data file
xml2::write_xml(data_object, file = "my_data.xml")  # exporting to xml file
haven::write_sas(data_object, file = "my_data.xport")  # exporting SAS data file
haven::write_dta(data_object, file = "my_data.dta")  # exporting STATA data file
haven::write_sav(data_object, file = "my_data.sav")  # exporting SPSS data file

You can download the demo data files fro various types from the link below:-

CLICK here to open the datasets folder

4.5 Tasks

4.5.1 Exercise 1

  1. Import a CSV file named “sales_data.csv” into RStudio.

    NOTE: You can download the sales data to your computer from; Click here Download the Sales Data

  2. Export the data in the “sales_data” object as a CSV file named “sales_data_export.csv”.

  3. Using the “summary” function, extract basic statistics (mean, median, minimum, maximum) for the “sales_data” object.

  4. Load a new data file named “customer_data.xlsx” into RStudio.

  5. Explore the data by using the “head” and “tail” functions to view the first and last few rows of the data.

  6. Use the “str” function to display the structure of the “customer_data” object.

  7. Pull basic statistics (mean, median, minimum, maximum) for the “customer_data” object using the “summary” function.

  8. Export the data in the “customer_data” object as a CSV file named “customer_data_export.csv”.

4.5.2 Exercise 2

Import the inbuilt data of iris into RStudio.

  1. Using the “plot” function, create a scatter plot of the “Sepal.Length” and “Sepal.Width” columns in the “iris” object.

  2. Using the “boxplot” function, create a box plot of the “Petal.Length” column in the “iris” object, grouped by the “Species” column.

  3. Using the “hist” function, create a histogram of the “Petal.Width” column in the “iris” object, grouped by the “Species” column.

  4. Using the “plot” function, create a line plot of the “Sepal.Length” column in the “iris” object, grouped by the “Species” column.