With support from the Center for Research Data and Digital Scholarship and the University of Colorado Boulder
This document contains an introduction to the installation of R, how to install packages, and an introduction to object-based coding concepts. If you are having trouble downloading R and installing your first packages, please view the optional check in assessment at https://jayholster.shinyapps.io/RLevel0Assessment/
This is an R Markdown document (.Rmd for file extensions). R Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents that can include blocks of code, as well as space for narrative to describe the code. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can quickly insert chunks into your R Markdown file with the keyboard shortcut Cmd + Option + I (Windows Ctrl + Alt + I).
R is one of the most popular programming languages for data science. This introductory course aims to provide participants an opportunity to make a start towards learning R for a variety of data science tasks, include statistical analysis, data visualization, natural language processing, and others.
By the end of this course and the conjoining set of short courses over the semester, among several other skills, you will have resources to use R to visualize distributions of data across categorical groups:
library(ggstatsplot)
data <- read.csv('musicclassintentions.csv')
ggbetweenstats(
data = data,
x = Level,
y = intentions,
title = "Distribution of Intentions to Sign Up for Music Across Grade Level"
)
Calculate correlation coefficients, and visualize relationships between your data:
library(psych)
library(tidyverse)
variablelist <- data %>% select(intentions, values, needs, parentsupport, SESComp)
psych::pairs.panels(variablelist,
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE,
lm = TRUE# show correlation ellipses
)
Fit General Linear Models (GLM) and produce publishable tables:
library(tidyverse)
library(knitr)
library(broom)
tidy(lm(intentions ~ SESComp + parentsupport + needs + values, data=data)) %>%
kable(caption = "Estimates for a Model Fitted to Estimate Variation in Music Elective Intentions.",
col.names = c("Predictor", "B", "SE", "t", "p"),
digits = c(0, 2, 3, 2, 3))
Predictor | B | SE | t | p |
---|---|---|---|---|
(Intercept) | -5.11 | 6.523 | -0.78 | 0.439 |
SESComp | 3.80 | 1.710 | 2.22 | 0.034 |
parentsupport | 0.48 | 0.301 | 1.60 | 0.120 |
needs | -0.28 | 0.120 | -2.35 | 0.025 |
values | 0.29 | 0.074 | 3.87 | 0.001 |
Understand and utilize logistic and linear regression analysis. Additionally, you will be able to fit, interpret, and visualize Structural Equation Models (SEM):
In addition to each of these tasks, you will also be introduced to both text and network analysis in R. There will be an informal assessment tied to each chapter so you can test and apply your skills as you move through the book.
Before we get ahead of ourselves, take a few minutes to download and install both R and RStudio.
Where R is a programming language, RStudio is an integrated development environment (IDE) which enables users to efficiently access and view most facets of the program in a four pane environment. These include the source, console, environment and history, as well as the files, plots, packages and help. The console is in the lower-left corner, and this is where commands are entered and output is printed. The source pane is in the upper-left corner, and is a built in text editor. While the console is where commands are entered, the source pane includes executable code which communicates with the console. The environment tab, in the upper-right corner displays an list of loaded R objects. The history tab tracks the user’s keystrokes entered into the console. Tabs to view plots, packages, help, and output viewer tabs are in the lower-right corner.
Where SPSS and other menu based analytic software are limited by user input and installed software features, R operates as a mediator between user inputs and open source developments from our colleagues all over the world. This affords R users a certain flexibility. However, it takes a few extra steps to appropriately launch projects. Regardless of your needs with R, you will likely interact with the following elements of document set up.
Download R from <https://cran.r-project.org/> Choose your operating system (Windows, MacOS, or Linux) and download as you would any other program.
Download the free version of RStudio for your OS from <https://www.rstudio.com/products/rstudio/download/> Follow prompts to install.
From the cran project (link for full source: <https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf>): “When R is running, variables, data, functions, results, etc, are stored in the active memory of the computer in the form of objects which have a name. The user can do actions on these objects with operators (arithmetic, logical,comparison, . . .) and functions (which are themselves objects).”
“Numquam ponenda est pluralitas sine necessitate. Plurality is never to be posited without necessity.”
William of Occam, circa 1495
The most formidable challenge many new R users face is learning to code. While coding can seem daunting at first, it is important to remember that all coding tasks simply involve solutions to problems the user identifies. No matter how difficult the problem, there are always a lot of solutions to each problems, and someone else has always encountered the solution, and likely has posted it to a forum. Occams Razor (i.e., the solution with the least amount of assumptions is the best) helps you identify the problems to solve as you interact with code. For new users, your breakthrough moment where you start to feel like a programmer might come from a well-worded google search and a focused effort to solve an issue with the R programming language. It is, indeed, a language, so half of the battle is learning to read code in a manner that is meaningful to you. Throughout this book, you will be provided with tutorials and suggestions for reading the code you interact with.
It is important — whether you are working alone or with others — to adopt a collaboration mindset. This value is clearly important when working with other statistical collaborators or with domain experts who do not have experience in R. Even experienced users might become confused when examining a peer’s code. The same effect may occur if you return to a project after many months, and find yourself lost in your own code. As such, I recommend utilizing R markdown files (File -> New File -> R Markdown) and comments (#) to provide notes to yourself and others who might interact with your code. For instance, this is an R Markdown document (.Rmd for file extensions). R Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents that can include blocks of code, as well as space for narrative to describe the code. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can quickly insert chunks into your R Markdown file with the keyboard shortcut Cmd + Option + I (Ctrl + Alt + I for Windows users). Comments can be utilized within code chunks, and are not considered a functioning part of the code to R.
x <- 10 # This is an example of a comment.
No matter what coding format you choose, insert narrative into your document in a way that makes sense to you. It may be helpful to split your code up into small, easy to digest chunks as to not become overwhelmed when examining your work. Lastly, it is also helpful to use a project specific directory and to frequently save your work.
When you first use R follow this procedure for Windows and MAC OSX:
Create a sub-directory (folder), “R” for example, in your “Documents” folder. This sub-folder, also known as working directory, will be used by R to read and save files. Think of it as a downloads folder for R only.
You can specify your working directory to R in a few ways. Click the session at the top of your screen and choose your directory. It might also be useful to change the directory using coding. To do this, use the function ‘setwd’, and then enter the location of your directory.
#setwd('Users/jacob/R/R Series')
If this is your first foray into coding, you might think of it as a conversation you are having with R about a problem you are trying to solve. You can talk to R using numerical digits and text. Operators are the symbols that connect your numbers or words with mathematical (e.g. addition), relational (e.g.>=), and logical manipulations (e.g. conditional coding)
To start coding with mathematical operators, enter a number in the code box below, then click the run button.
Now, pick a set of two numerals to sum, placing an addition sign between them. Then click the run button.
7+2
## [1] 9
Base R comes with working mathematical operators for addition (+), subtraction (-), multiplication (*), division (/), and exponents (^). I’ve left an example for you below. Try making your own.
7+2-10*40
## [1] -391
Functions allow pieces of your input to be connected. For example, the sum function adds a set of numbers which are specified within parenthesis as demonstrated below.
sum(2,4)
## [1] 6
Try to run this set.
sum(2 4)
## Error: <text>:1:7: unexpected numeric constant
## 1: sum(2 4
## ^
When the comma is omitted, R returns an error. Fix the example above. If any part of your code is not correct, your document will not knit without additional encouragement.
You can sum a sequence, using two vectors separated by the colon operator (:), as seen below.
sum(4:20)
## [1] 204
Base R comes installed with a plethora of functions. R also helps you find the right function for you. Place your cursor after the function ‘seq’ but before the first parenthesis, and press tab. Hover over the function ‘seq’ in the dropdown list to see a full description. Read the description and examine the code. What do you think the output will be?
seq(from = 0, to = 20, by = 4)
## [1] 0 4 8 12 16 20
help(seq)
Was the output what you expected? The seq function generates a sequence of numbers. In this example, 0 and 20 are the upper and lower limits of this sequence. Now, look to the bottom right side RStudio. Since you ran the entire cell, the command ‘help(seq)’ launched a search in the R documentation for the seq function in addition to running the seq function. Here, we can ascertain that this function takes a set of arguments (e.g. from = 0, to = 20, by = 4). When you paste that exact code into the seq function, it generates the same result. Try it!
Hold that thought about arguments. To truly appreciate arguments, it’s important to have a working understanding of objects. First, an example of an object built into base R. The input below is not numeric, but still represents a number. Run the code, and you will see that the word ‘pi’ has been assigned the numeric value of pi. This is one of the few predefined objects in R. For your purposes in using R, you will likely be making your own.
pi
## [1] 3.141593
To assign values to objects, as the numerical version of pi was to the word pi, use the ‘<-’ operator. For example, the code segment below assigns the value 50 to the object ‘a’, and 14 to the object ‘b’ using the ‘<-’ operator. From a code reading perspective, it may be helpful to read the code out loud, saying a is 50, and b is 14. While this seems overly simplistic at this stage, objects can be complex.
a <- 50
b <- 14
Further, many objects are typically involved in coding. See the code chunk below for a simple example of the interaction of two objects.
a + b
## [1] 64
Notice that R held the object assignments from the previous cell. You can also assign a function to an object, and call that object to execute the function. For instance:
addvalues <- a + b
addvalues
## [1] 64
The product is not reached because R understands the input ‘addvalues’, but because the object add values calls the newly defined function ‘a + b’. Try switching the values of a and b three chunks ago, and running the subsequent chunks. Remember that objects are case sensitive and cannot contain spaces. If you ran the code ‘A+B’, what would happen?
Why do we not use the equal sign to assign objects? According to the R documentation, the “operator <- can be used anywhere, whereas the operator = is only allowed at the top level (e.g., in the complete expression typed at the command prompt) or as one of the subexpressions in a braced list of expressions”.
Packages are the fundamental unit of shareable code in R. Packages help standardize tools and conventions and save time. There is a package for almost anything you can think of. You can make packages yourself if you find yourself using R for a unique work flow, you might benefit from creating your own package.
You will not be able to see this screenshot of a tweet in the RMD file you download, while it will be viewable in the pdf or html knitted output of this RMD. To upload your own image, move or save a file to your working directory and use this format.
Today we will be focusing on tidyverse: The tidyverse installation contains multiple R packages that “share an underlying design philosophy, grammar, and data structure” (help(tidyverse)) for the purpose of data wrangling, analysis, and visualization. These include ggplot2 for data visualization, dplyr for data manipulation, tidyr, readr for , purrr, tibble, stringr, and forcats.
fivethirtyeight: The fivethirtyeight package includes 128 callable datasets from Nate Silver’s statistical analysis website. For a full list of datasets, follow this link: https://cran.r-project.org/web/packages/fivethirtyeight/vignettes/fivethirtyeight.html
To install packages, use the code below. Remember to wrap the package in quotes when you are installing it.
install.packages("tidyverse")
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
install.packages("fivethirtyeight")
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
Now call the packages using the library() function. You do not need to use quotes when calling packages you have already installed.
library(tidyverse)
library(fivethirtyeight)
## Some larger datasets need to be installed separately, like senators and
## house_district_forecast. To install these, we recommend you install the
## fivethirtyeightdata package by running:
## install.packages('fivethirtyeightdata', repos =
## 'https://fivethirtyeightdata.github.io/drat/', type = 'source')
We will be using a built-in dataset from the fivethirtyeight package that contains information on US Births from 1994 to 2003. Call the dataset with the filename ‘US_births_1994_2003’ and assign it to the object ‘data’ with the operator ‘<-’. See the code below.
data <- US_births_1994_2003
View(data)
The head() and tail() functions show the first and last six rows of a dataframe. Notice that when you run the entire cell, R studio allows you to scroll through the various outputs.
head(data)
## # A tibble: 6 x 6
## year month date_of_month date day_of_week births
## <int> <int> <int> <date> <ord> <int>
## 1 1994 1 1 1994-01-01 Sat 8096
## 2 1994 1 2 1994-01-02 Sun 7772
## 3 1994 1 3 1994-01-03 Mon 10142
## 4 1994 1 4 1994-01-04 Tues 11248
## 5 1994 1 5 1994-01-05 Wed 11053
## 6 1994 1 6 1994-01-06 Thurs 11406
tail(data)
## # A tibble: 6 x 6
## year month date_of_month date day_of_week births
## <int> <int> <int> <date> <ord> <int>
## 1 2003 12 26 2003-12-26 Fri 10218
## 2 2003 12 27 2003-12-27 Sat 8646
## 3 2003 12 28 2003-12-28 Sun 7645
## 4 2003 12 29 2003-12-29 Mon 12823
## 5 2003 12 30 2003-12-30 Tues 14438
## 6 2003 12 31 2003-12-31 Wed 12374
To investigate the names of the columns, run the function colnames(). For a summary of each column of data, call the summary() function. You can also see the data using view().
colnames(data)
## [1] "year" "month" "date_of_month" "date"
## [5] "day_of_week" "births"
summary(data)
## year month date_of_month date
## Min. :1994 Min. : 1.000 Min. : 1.00 Min. :1994-01-01
## 1st Qu.:1996 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:1996-07-01
## Median :1998 Median : 7.000 Median :16.00 Median :1998-12-31
## Mean :1998 Mean : 6.524 Mean :15.73 Mean :1998-12-31
## 3rd Qu.:2001 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:2001-07-01
## Max. :2003 Max. :12.000 Max. :31.00 Max. :2003-12-31
##
## day_of_week births
## Sun :522 Min. : 6443
## Mon :522 1st Qu.: 8844
## Tues :522 Median :11615
## Wed :522 Mean :10877
## Thurs:521 3rd Qu.:12274
## Fri :521 Max. :14540
## Sat :522
view(data)
Most R workflows involve some of these five basic data types. These include integers (e.g. 2), numeric values (e.g. 2.5), factors or variables with levels (e.g. gender), logical values (i.e. True/False), and characters (e.g. “text”).
Try running the functions below to examine different facets of the dataset. To run a single line within a code chunk, highlight the code you want and press ‘command + enter’ on mac and ‘ctrl + enter’ on windows.
length(data) # number of elements or components
## [1] 6
str(data) # structure of an object
## tibble [3,652 × 6] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:3652] 1994 1994 1994 1994 1994 1994 1994 1994 1994 1994 ...
## $ month : int [1:3652] 1 1 1 1 1 1 1 1 1 1 ...
## $ date_of_month: int [1:3652] 1 2 3 4 5 6 7 8 9 10 ...
## $ date : Date[1:3652], format: "1994-01-01" "1994-01-02" ...
## $ day_of_week : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tues"<..: 7 1 2 3 4 5 6 7 1 2 ...
## $ births : int [1:3652] 8096 7772 10142 11248 11053 11406 11251 8653 7910 10498 ...
class(data) # class or type of an object
## [1] "tbl_df" "tbl" "data.frame"
names(data) # column names
## [1] "year" "month" "date_of_month" "date"
## [5] "day_of_week" "births"
Sometimes when your data is loaded, R will recognize a column with an incorrect structure. For example, you can change the data type of the column ‘month’ from integer to numeric using the function as.numeric(). The $ sign allows for the access to columns based on their names in the dataset.
data$month <- as.numeric(data$month)
str(data)
## tibble [3,652 × 6] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:3652] 1994 1994 1994 1994 1994 1994 1994 1994 1994 1994 ...
## $ month : num [1:3652] 1 1 1 1 1 1 1 1 1 1 ...
## $ date_of_month: int [1:3652] 1 2 3 4 5 6 7 8 9 10 ...
## $ date : Date[1:3652], format: "1994-01-01" "1994-01-02" ...
## $ day_of_week : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tues"<..: 7 1 2 3 4 5 6 7 1 2 ...
## $ births : int [1:3652] 8096 7772 10142 11248 11053 11406 11251 8653 7910 10498 ...
When operationalizing factors for quantitative data analysis, you might want to convert a string based factor to numeric values. The code below converts the string data in day of the week column into a numeric vector.
head(data$day_of_week)
## [1] Sat Sun Mon Tues Wed Thurs
## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat
num_day_of_Week <- as.numeric(data$day_of_week)
head(num_day_of_Week)
## [1] 7 1 2 3 4 5
head(data)
## # A tibble: 6 x 6
## year month date_of_month date day_of_week births
## <int> <dbl> <int> <date> <ord> <int>
## 1 1994 1 1 1994-01-01 Sat 8096
## 2 1994 1 2 1994-01-02 Sun 7772
## 3 1994 1 3 1994-01-03 Mon 10142
## 4 1994 1 4 1994-01-04 Tues 11248
## 5 1994 1 5 1994-01-05 Wed 11053
## 6 1994 1 6 1994-01-06 Thurs 11406
Use the unique() function to show the unique values in a column.
# Get the unique years of the data
unique(data$year)
## [1] 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
Compute the mean of a vector.
# Compute the average number of births on any day in any year.
mean(data$births)
## [1] 10876.82
Compute the standard deviation of a vector.
# Compute the standard deviation of the number of births in a given day
sd(data$births)
## [1] 1858.567
Use max() and min() to identify the smallest and largest values in a column, and quantile() to determine data distributions.
max(data$births)
## [1] 14540
min(data$births)
## [1] 6443
quantile(data$births, probs=c(0.0, 0.33, 0.50, 0.66, 1.0))
## 0% 33% 50% 66% 100%
## 6443.00 10683.77 11615.00 12061.00 14540.00
To get quantiles for births from 1995 alone, create a new object with filtered data.
births_1995 <- filter(data, year==1995)$births
quantile(births_1995)
## 0% 25% 50% 75% 100%
## 6999 8949 11344 11897 13023
Base R allows you to find the index of one column where a maximum occurs in another. For example, the following code uses the max() and which.max() functions to identify the date when the maximum number of births took place.
# Find the date of the maximum births
max(data$births)
## [1] 14540
maxindex <- which.max(data$births)
maxdate <- data$date[maxindex]
maxdate
## [1] "1999-09-09"
You can use the filter() function from the tidyverse to achieve the same result in one line of code.
filter(data, births==max(births))
## # A tibble: 1 x 6
## year month date_of_month date day_of_week births
## <int> <dbl> <int> <date> <ord> <int>
## 1 1999 9 9 1999-09-09 Thurs 14540
We want to find the max number of births for each day of the week for any year. The answers for Sundays and Mondays are below. Write your own code in the following code chunk that tells you the same information for Tuesday and Wednesday Sunday: 8926, August 14th, 1994 Monday: 12967, December 22nd, 2003 Tuesday: Wednesday:
# this is what filter is doing on groups
sundays <- filter(data, day_of_week == 'Sun')
filter(sundays, births==max(births))
## # A tibble: 1 x 6
## year month date_of_month date day_of_week births
## <int> <dbl> <int> <date> <ord> <int>
## 1 1994 8 14 1994-08-14 Sun 8926
mondays <- filter(data, day_of_week == 'Mon')
filter(mondays, births==max(births))
## # A tibble: 1 x 6
## year month date_of_month date day_of_week births
## <int> <dbl> <int> <date> <ord> <int>
## 1 2003 12 22 2003-12-22 Mon 12967
tuesdays <- filter(data, day_of_week == 'Tues')
filter(tuesdays, births==max(births))
## # A tibble: 1 x 6
## year month date_of_month date day_of_week births
## <int> <dbl> <int> <date> <ord> <int>
## 1 2003 12 30 2003-12-30 Tues 14438
Use the group_by() function and filter the dataset by max births to display the max number of births on each day of the week for any year.
data_grouped <- group_by(data, day_of_week)
filter(data_grouped, births==max(births))
## # A tibble: 7 x 6
## # Groups: day_of_week [7]
## year month date_of_month date day_of_week births
## <int> <dbl> <int> <date> <ord> <int>
## 1 1994 8 14 1994-08-14 Sun 8926
## 2 1994 9 17 1994-09-17 Sat 9779
## 3 1999 9 9 1999-09-09 Thurs 14540
## 4 2001 12 28 2001-12-28 Fri 13918
## 5 2003 9 3 2003-09-03 Wed 14119
## 6 2003 12 22 2003-12-22 Mon 12967
## 7 2003 12 30 2003-12-30 Tues 14438
You can also group by more than one column.
# Add year as another grouping variable
data_grouped <- group_by(data, day_of_week, year)
data_max_day_year <- filter(data_grouped, births==max(births))
data_max_day_year
## # A tibble: 70 x 6
## # Groups: day_of_week, year [70]
## year month date_of_month date day_of_week births
## <int> <dbl> <int> <date> <ord> <int>
## 1 1994 7 6 1994-07-06 Wed 13086
## 2 1994 7 7 1994-07-07 Thurs 13049
## 3 1994 8 14 1994-08-14 Sun 8926
## 4 1994 9 16 1994-09-16 Fri 12884
## 5 1994 9 17 1994-09-17 Sat 9779
## 6 1994 11 21 1994-11-21 Mon 11807
## 7 1994 12 20 1994-12-20 Tues 12880
## 8 1995 9 6 1995-09-06 Wed 12951
## 9 1995 9 7 1995-09-07 Thurs 12924
## 10 1995 9 9 1995-09-09 Sat 9714
## # … with 60 more rows
Now we can use ggplot to visualize the max births by weekday over time. To write the edited dataset use the write_csv() function.
ggplot(data_max_day_year, aes(x = year, y = births, colour = day_of_week)) +
geom_line() +
ylab("Max births") +
ylim(c(0,15000))
write_csv(data_max_day_year, "max_births_per_dayofweek_per_year.csv")
There are many useful functions for altering your data frame. In this section you will start to see the %>% operator. You can read this operator as the word ‘then’ when you are reading code to yourself (e.g. take the original dataset, then arrange it by births in descending order)
You can use this operator to apply functions to your dataset. For instance, order the data frame by a certain column (default is ascending order).
# Order by number of births
data <- data %>% arrange(desc(births))
head(data)
## # A tibble: 6 x 6
## year month date_of_month date day_of_week births
## <int> <dbl> <int> <date> <ord> <int>
## 1 1999 9 9 1999-09-09 Thurs 14540
## 2 2003 12 30 2003-12-30 Tues 14438
## 3 2003 9 16 2003-09-16 Tues 14145
## 4 2003 9 3 2003-09-03 Wed 14119
## 5 2003 9 23 2003-09-23 Tues 14036
## 6 2002 9 12 2002-09-12 Thurs 13982
data <- data %>% arrange(year)
head(data)
## # A tibble: 6 x 6
## year month date_of_month date day_of_week births
## <int> <dbl> <int> <date> <ord> <int>
## 1 1994 7 6 1994-07-06 Wed 13086
## 2 1994 7 7 1994-07-07 Thurs 13049
## 3 1994 9 16 1994-09-16 Fri 12884
## 4 1994 12 20 1994-12-20 Tues 12880
## 5 1994 9 9 1994-09-09 Fri 12811
## 6 1994 11 22 1994-11-22 Tues 12764
Rename specific columns of your data without opening and editing the file.
# Rename day_of_month and day_of_week columns
data <- data %>% rename(day = date_of_month, weekday = day_of_week)
head(data)
## # A tibble: 6 x 6
## year month day date weekday births
## <int> <dbl> <int> <date> <ord> <int>
## 1 1994 7 6 1994-07-06 Wed 13086
## 2 1994 7 7 1994-07-07 Thurs 13049
## 3 1994 9 16 1994-09-16 Fri 12884
## 4 1994 12 20 1994-12-20 Tues 12880
## 5 1994 9 9 1994-09-09 Fri 12811
## 6 1994 11 22 1994-11-22 Tues 12764
Select rows of a data frame based on a certain condition. Use the dim() function to ascertain the dimensions of the resultant dataset.
# Select days only in the month of July and do the same for Jan
july_data <- data %>% filter(month == 7)
jan_data <- data %>% filter(month == 1)
# Select days only in the month of january in the year 2000
jan_2000_data <- data %>% filter(month == 1 & year == 2000)
dim(jan_2000_data)
## [1] 31 6
Filter out the data for the first half of the year (January through June).
janthrujune <- data %>% filter(month == 1:6)
## Warning in month == 1:6: longer object length is not a multiple of shorter
## object length
min(janthrujune$month)
## [1] 1
janthrujune
## # A tibble: 341 x 6
## year month day date weekday births
## <int> <dbl> <int> <date> <ord> <int>
## 1 1994 3 15 1994-03-15 Tues 12298
## 2 1994 6 30 1994-06-30 Thurs 12157
## 3 1994 2 8 1994-02-08 Tues 12152
## 4 1994 6 7 1994-06-07 Tues 12145
## 5 1994 6 10 1994-06-10 Fri 12006
## 6 1994 3 16 1994-03-16 Wed 11865
## 7 1994 3 10 1994-03-10 Thurs 11792
## 8 1994 5 4 1994-05-04 Wed 11754
## 9 1994 3 2 1994-03-02 Wed 11735
## 10 1994 5 20 1994-05-20 Fri 11645
## # … with 331 more rows
Select specific columns of a data frame using the select() function
# Select days only the columns of date and births
selected_data <- data %>% select(date, births)
selected_data
## # A tibble: 3,652 x 2
## date births
## <date> <int>
## 1 1994-07-06 13086
## 2 1994-07-07 13049
## 3 1994-09-16 12884
## 4 1994-12-20 12880
## 5 1994-09-09 12811
## 6 1994-11-22 12764
## 7 1994-09-08 12693
## 8 1994-07-15 12691
## 9 1994-09-07 12660
## 10 1994-09-15 12655
## # … with 3,642 more rows
Separate one column into several.
# Separate date of selected_data into 3 columns
parsed_data <- selected_data %>% separate(date, c('y','m','d'))
parsed_data
## # A tibble: 3,652 x 4
## y m d births
## <chr> <chr> <chr> <int>
## 1 1994 07 06 13086
## 2 1994 07 07 13049
## 3 1994 09 16 12884
## 4 1994 12 20 12880
## 5 1994 09 09 12811
## 6 1994 11 22 12764
## 7 1994 09 08 12693
## 8 1994 07 15 12691
## 9 1994 09 07 12660
## 10 1994 09 15 12655
## # … with 3,642 more rows
You can also add an additional column to the data frame.
# Create a column that indicates if it is a summer month (June, July, August).
data <- data %>% mutate(summer = between(month,6,8))
str(data)
## tibble [3,652 × 7] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:3652] 1994 1994 1994 1994 1994 1994 1994 1994 1994 1994 ...
## $ month : num [1:3652] 7 7 9 12 9 11 9 7 9 9 ...
## $ day : int [1:3652] 6 7 16 20 9 22 8 15 7 15 ...
## $ date : Date[1:3652], format: "1994-07-06" "1994-07-07" ...
## $ weekday: Ord.factor w/ 7 levels "Sun"<"Mon"<"Tues"<..: 4 5 6 3 6 3 5 6 4 5 ...
## $ births : int [1:3652] 13086 13049 12884 12880 12811 12764 12693 12691 12660 12655 ...
## $ summer : logi [1:3652] TRUE TRUE FALSE FALSE FALSE FALSE ...
And compute summary statistics of your data frame.
# Compute the mean number of births for a given day
overall_mean <- data %>% summarise(average = mean(births))
Grouping by columns to calculate summary statistics.
# Compute the mean and median number of births by weekday
weekday_mean <- data %>%
group_by(weekday) %>%
summarise(average = mean(births))
weekday_median <- data %>%
group_by(weekday) %>%
summarise(median = median(births))
#join
meansandmedians <- right_join(weekday_mean, weekday_median, by = NULL, copy = FALSE, suffix = c(".x", ".y"))
meansandmedians
## # A tibble: 7 x 3
## weekday average median
## <ord> <dbl> <dbl>
## 1 Sun 7816. 7780
## 2 Mon 11090. 11198
## 3 Tues 12349. 12392.
## 4 Wed 12113. 12128.
## 5 Thurs 12070. 12168
## 6 Fri 11965. 12047
## 7 Sat 8740. 8696.
Bind two dataframes together by row.
# Bind data frames by row
janjul_data <- jan_data %>%
bind_rows(july_data)
tail(janjul_data)
## # A tibble: 6 x 6
## year month day date weekday births
## <int> <dbl> <int> <date> <ord> <int>
## 1 2003 7 12 2003-07-12 Sat 8776
## 2 2003 7 5 2003-07-05 Sat 8209
## 3 2003 7 20 2003-07-20 Sun 7954
## 4 2003 7 13 2003-07-13 Sun 7867
## 5 2003 7 6 2003-07-06 Sun 7789
## 6 2003 7 27 2003-07-27 Sun 7740
Bind two dataframes together by similar columns.
# Join two data frames together
joined_data <- jan_data %>%
left_join(july_data, by = c("year","day"))
tail(joined_data)
## # A tibble: 6 x 10
## year month.x day date.x weekday.x births.x month.y date.y weekday.y
## <int> <dbl> <int> <date> <ord> <int> <dbl> <date> <ord>
## 1 2003 1 25 2003-01-25 Sat 8241 7 2003-07-25 Fri
## 2 2003 1 1 2003-01-01 Wed 7783 7 2003-07-01 Tues
## 3 2003 1 19 2003-01-19 Sun 7366 7 2003-07-19 Sat
## 4 2003 1 5 2003-01-05 Sun 7365 7 2003-07-05 Sat
## 5 2003 1 26 2003-01-26 Sun 7295 7 2003-07-26 Sat
## 6 2003 1 12 2003-01-12 Sun 7214 7 2003-07-12 Sat
## # … with 1 more variable: births.y <int>
Separate multiple columns into one.
# Select days only in the month of July
gathered_data <- joined_data %>%
gather(key = month, value = births, c("births.x","births.y"))
# key: column name representing new variable
# value: column name representing variable values
# remaining: columns to gather
Save the data frame to a .csv file.
write_csv(gathered_data, "gathered.csv")
We will be working with a dataset from the fivethrityeight package again. If you have not yet installed this package, run “install.packages(”fivethiryeight“)” in the R-console. You will also need the “ggpubr” package. This will allow us to plot multiple plots on the same figure. Now let’s load our libraries. If you do not have the packages already, install the tidyverse, fivethirtyeight, and ggpubr packages.
The tidyverse installation contains multiple R package for the purpose of data wrangling, analysis, and visualization. These include ggplot2 for data visualization, dplyr for data manipulation, tidyr, readr for , purrr, tibble, stringr, and forcats. The fivethirtyeight package includes 128 callable datasets from Nate Silver’s statistical analysis website. For a full list of datasets, follow this link: https://cran.r-project.org/web/packages/fivethirtyeight/vignettes/fivethirtyeight.html
ggpubr includes tools for creating publishable plots, making it easy to annotate and arrange multiple plots. Install any packages that you need, and then load the libraries using the next two code chunks.
#install.packages("ggpubr")
#install.packages("tidyverse")
#install.packages("fivethirtyeight")
library(tidyverse)
library(fivethirtyeight)
library(ggpubr)
We are going to start with data regarding the Bechdel test, which measures the representation of women in fictional films. Films that pass the Bechdel test will feature at least two women who talk to each other about something other than a man. Roughly half of films pass the test, however, films that do tended to earn more money than films that failed. Let’s explore and visualize this dataset to see the extent to which financial and other implications can be identified.
First, load the data, assigning it to the object ‘bech_data’. Then look at the dataset using the View() and str(), short for structure, functions.
bech_data <- bechdel
View(bech_data)
str(bech_data)
## tibble [1,794 × 15] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:1794] 2013 2012 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ imdb : chr [1:1794] "tt1711425" "tt1343727" "tt2024544" "tt1272878" ...
## $ title : chr [1:1794] "21 & Over" "Dredd 3D" "12 Years a Slave" "2 Guns" ...
## $ test : chr [1:1794] "notalk" "ok-disagree" "notalk-disagree" "notalk" ...
## $ clean_test : Ord.factor w/ 5 levels "nowomen"<"notalk"<..: 2 5 2 2 3 3 2 5 5 2 ...
## $ binary : chr [1:1794] "FAIL" "PASS" "FAIL" "FAIL" ...
## $ budget : int [1:1794] 13000000 45000000 20000000 61000000 40000000 225000000 92000000 12000000 13000000 130000000 ...
## $ domgross : num [1:1794] 25682380 13414714 53107035 75612460 95020213 ...
## $ intgross : num [1:1794] 4.22e+07 4.09e+07 1.59e+08 1.32e+08 9.50e+07 ...
## $ code : chr [1:1794] "2013FAIL" "2012PASS" "2013FAIL" "2013FAIL" ...
## $ budget_2013 : int [1:1794] 13000000 45658735 20000000 61000000 40000000 225000000 92000000 12000000 13000000 130000000 ...
## $ domgross_2013: num [1:1794] 25682380 13611086 53107035 75612460 95020213 ...
## $ intgross_2013: num [1:1794] 4.22e+07 4.15e+07 1.59e+08 1.32e+08 9.50e+07 ...
## $ period_code : int [1:1794] 1 1 1 1 1 1 1 1 1 1 ...
## $ decade_code : int [1:1794] 1 1 1 1 1 1 1 1 1 1 ...
bech_data
## # A tibble: 1,794 x 15
## year imdb title test clean_test binary budget domgross intgross code
## <int> <chr> <chr> <chr> <ord> <chr> <int> <dbl> <dbl> <chr>
## 1 2013 tt17… 21 &… nota… notalk FAIL 1.30e7 25682380 4.22e7 2013…
## 2 2012 tt13… Dred… ok-d… ok PASS 4.50e7 13414714 4.09e7 2012…
## 3 2013 tt20… 12 Y… nota… notalk FAIL 2.00e7 53107035 1.59e8 2013…
## 4 2013 tt12… 2 Gu… nota… notalk FAIL 6.10e7 75612460 1.32e8 2013…
## 5 2013 tt04… 42 men men FAIL 4.00e7 95020213 9.50e7 2013…
## 6 2013 tt13… 47 R… men men FAIL 2.25e8 38362475 1.46e8 2013…
## 7 2013 tt16… A Go… nota… notalk FAIL 9.20e7 67349198 3.04e8 2013…
## 8 2013 tt21… Abou… ok-d… ok PASS 1.20e7 15323921 8.73e7 2013…
## 9 2013 tt18… Admi… ok ok PASS 1.30e7 18007317 1.80e7 2013…
## 10 2013 tt18… Afte… nota… notalk FAIL 1.30e8 60522097 2.44e8 2013…
## # … with 1,784 more rows, and 5 more variables: budget_2013 <int>,
## # domgross_2013 <dbl>, intgross_2013 <dbl>, period_code <int>,
## # decade_code <int>
The columns in the data set include the imdb code, title, financial information, as well as the year, decade code, and period code. To create pragmatic visualizations of financial data, it will be useful to rescale the columns such that the values are displayed in millions of dollars, rather than dollars. Simply use mathematical operators (e.g., data$column/20) on the columns of interest, and call back the edited columns to their original names. Make sure you do not run this chunk more than once. What would happen?
bech_data$budget <- bech_data$budget/1000000
bech_data$domgross <- bech_data$domgross/1000000
bech_data$intgross <- bech_data$intgross/1000000
bech_data
## # A tibble: 1,794 x 15
## year imdb title test clean_test binary budget domgross intgross code
## <int> <chr> <chr> <chr> <ord> <chr> <dbl> <dbl> <dbl> <chr>
## 1 2013 tt17… 21 &… nota… notalk FAIL 13 25.7 42.2 2013…
## 2 2012 tt13… Dred… ok-d… ok PASS 45 13.4 40.9 2012…
## 3 2013 tt20… 12 Y… nota… notalk FAIL 20 53.1 159. 2013…
## 4 2013 tt12… 2 Gu… nota… notalk FAIL 61 75.6 132. 2013…
## 5 2013 tt04… 42 men men FAIL 40 95.0 95.0 2013…
## 6 2013 tt13… 47 R… men men FAIL 225 38.4 146. 2013…
## 7 2013 tt16… A Go… nota… notalk FAIL 92 67.3 304. 2013…
## 8 2013 tt21… Abou… ok-d… ok PASS 12 15.3 87.3 2013…
## 9 2013 tt18… Admi… ok ok PASS 13 18.0 18.0 2013…
## 10 2013 tt18… Afte… nota… notalk FAIL 130 60.5 244. 2013…
## # … with 1,784 more rows, and 5 more variables: budget_2013 <int>,
## # domgross_2013 <dbl>, intgross_2013 <dbl>, period_code <int>,
## # decade_code <int>
Now that we are familiar with our data, let’s start of with some qplots. The ‘q’ stands for quick in qplots. This is just a way to produce quick plots for on-the-fly data visualization. We highly recommend you use the ggplot method of plotting, which we will begin covering in the next section and spend most of today working with. However, quick plots can be useful at times.
There are four key arguments to consider when creating a qplot. These are data, where you define your dataset, x and y, where you define the x and y variables respectively, and geom, which dictates the geometry of your plot. Geom options include point, line, smooth, dotplot, boxplot, violin, histogram, and density
qplot(data = "data_frame", x = "x_variable", y = "y_variable", geom = "whatever_plot_you_want")
Let’s try a few examples.
The following code will produce a set of boxplots describing the distributions of movie budgets disaggregated by year. R treats column data, such as the year column, as a single vector. Set the year column as a factor within the x argument using the as.factor() function. What does the second plot tell you that the first one fails to communicate?
qplot(data = bech_data, x = year, y= domgross, geom = 'boxplot')
qplot(data = bech_data, x = as.factor(year), y= domgross, geom = 'boxplot')
Period codes were included in the dataset to compare movies which were released in the same time period. Data included movies from 1970 to 2013, which were categorized into five period codes. Let’s group the domestic gross values by period codes to compare the distributions of film’s earnings over time. Let’s also compare boxplots to violin plots, which show the full distribution of data instead of the plotting summary statistics. What do violin plots tell you that boxplots fail to capture?
qplot(data = bech_data, x = period_code, y= domgross, group=period_code, geom = 'boxplot')
qplot(data = bech_data, x = period_code, y= domgross, group=period_code, geom = 'violin')
Now let’s start using ggplot. We have much more flexibility with the ggplot framework. Let’s make the same violin plot for period_code as we did in the qplot using ggplot.
ggplot(data = bech_data, aes(x = as.factor(period_code), y = domgross)) +
geom_violin()
Using ggplot, we can dive deep into histograms and density plots. These particular visualizations provide insight at a glance for the distributions of data across several groups. To start, let’s create a histogram of the domestic gross revenue.
ggplot(data = bech_data, aes(x = domgross)) +
geom_histogram()
Now let’s add some bells and whistles using ggplot’s layers-based coding to make this visualization more readable and appealing.
ggplot(data = bech_data, aes(x = domgross, fill=binary)) +
geom_histogram(alpha=0.8, colour = 'grey') +
ggtitle("Distribution of Movies in FiveThirtyEight Bechdel Dataset") +
xlab('Domestic Gross in Millions') +
ylab('Density') +
labs(fill = 'Bechdel Test')
You can specify the color (outline) and fill color of plots. You can also assign ggplots to objects. The gpubr function ggarrange() allows you to easily present multiple plots in the same output.
color <- ggplot(data = bech_data, aes(x = domgross, y = ..density..)) +
geom_histogram(color = 'blue') +
ggtitle("Outlining the histogram") +
xlab("Domestic Gross Revenue")
fill <- ggplot(data = bech_data, aes(x = domgross, y = ..density..)) +
geom_histogram(fill = 'green') +
ggtitle("Filling the histogram") +
xlab("Domestic Gross Revenue")
ggarrange(color, fill)
You can vary the number of bins in each histogram. Run the following chunk to see the implications of the bin argument.
# Vary the number of bins per histogram
bin60 <- ggplot(data = bech_data, aes(x = domgross, y = ..density..)) +
geom_histogram(bins = 60) +
ggtitle("60 bins") +
xlab("Domestic Gross Revenue")
bin30 <- ggplot(data = bech_data, aes(x = domgross, y = ..density..)) +
geom_histogram(bins = 30) +
ggtitle("30 bins") +
xlab("Domestic Gross Revenue")
bin15 <- ggplot(data = bech_data, aes(x = domgross, y = ..density..)) +
geom_histogram(bins = 15) +
ggtitle("15 bins") +
xlab("Domestic Gross Revenue")
ggarrange(bin60, bin30, bin15, ncol = 3)
The density plot is a variation of the histogram, where values in columns are smoothed to be equally distributed. You can use geom_density() to add a density plot on top of a histogram. The alpha argument refers to the opacity of the following argument, fill, where you define the color of the density plot.
# Add a density plot
ggplot(data = bech_data, aes(x = domgross, y = ..density..)) +
geom_histogram(bins = 60, color = 'grey', fill = 'blue') +
ggtitle("Distibution of Domestic Gross Revenue for Movies") +
xlab("Domestic Gross Revenue") +
geom_density(alpha = .4, fill = 'grey')
You can also create histograms to compare multiple groups within a single plot. Try switching the color argument with the fill argument, keeping the object binary in place. See if you can apply what you have learned so far to improve the appeal of the plot.
ggplot(data = bech_data, aes(x = domgross, y = ..density.., fill = binary)) +
geom_histogram(position = "identity", bins = 60, alpha = .5) +
ggtitle("Distibution of Domestic Gross Revenue for Movies") +
xlab("Domestic Gross Revenue") +
geom_density(alpha = .4)
Change the legend title using the layer scale_color_discrete() and the name argument.
# Change legend title
ggplot(data = bech_data, aes(x = domgross, y = ..density.., color = binary)) +
geom_histogram(position = "identity", bins = 60, alpha = .5) +
ggtitle("Distibution of Domestic Gross Revenue for Movies") +
xlab("Domestic Gross Revenue") +
geom_density(alpha = .4) +
scale_color_discrete(name = "Test")
Use ggsave() and set the desired filename, size, dpi, and other parameters for saving your plot.
# Save your plot
ggplot(data = bech_data, aes(x = domgross, y = ..density.., color = binary)) +
geom_histogram(position = "identity", bins = 60, alpha = .5) +
ggtitle("Distibution of Domestic Gross Revenue for Movies") +
xlab("Domestic Gross Revenue") +
geom_density(alpha = .4) +
scale_color_discrete(name = "Test") +
ggsave("Hist_Dens.png", width = 5, height = 5)
Let’s create some more box plots and violin plots using the period code column. As these plots are used to compare distributions of data, they are especially useful for comparing different groups of continuous data. You can use the mutate() function to set period_code as a factor, as to make your ggplot coding simpler and cleaner. In this iteration, let’s set try a notched box plot, which emphasize the median with notches.
bech_data <- bech_data %>% mutate(period_code = as.factor(period_code))
ggplot(data = bech_data, aes(x = period_code, y= domgross)) +
geom_boxplot(notch = TRUE)
You can add a dot to represent the mean domestic revenue for each group of movies using stat_summary(fun.y = mean, geom = “point”, color = “anycolor”)
ggplot(data = bech_data, aes(x = period_code, y = domgross)) +
geom_boxplot(notch = T) +
stat_summary(fun.y = mean, geom = "point", color = 'red')
Assign period_code to the argument color to assign each period code a unique color.
ggplot(data = bech_data, aes(x = period_code, y= domgross, color = period_code)) +
geom_boxplot(notch = T) +
stat_summary(fun.y=mean, geom="point", color = 'red')
If you need another task while the class is progressing, add a title, change the axis labels, and change the legend title to the last plot that we made.
You can create make group comparisons using box plots. Use the aes() function to specify the group using the color or fill argument, and the variables of interest using the x and y arguments. Instead of outlining this boxplot in color, fill the boxplot in color. Add a legend title. Use the layer “scale_fill_discrete” when “fill” is used in the aes. Use the layer “scale_color_discrete” when “color” is used in the aes.
ggplot(data = bech_data, aes(x = period_code, y = domgross, color = binary)) +
geom_boxplot(notch = TRUE)
Scatter plots are useful for visualizing how two sets of continuous data are related. Let’s plot domestic gross revenue vs. international gross revenue in the Bechdel dataset.
ggplot(data = bech_data, aes(x = domgross, y = intgross)) +
geom_point()
You can change the size, shape, and color of points in your scatter plot.
ggplot(data = bech_data, aes(x = domgross, y = intgross)) +
geom_point(size = 2, shape = 6, color = 'blue')
To label the points that represent top five movies based on international gross revenue, first arrange the international gross revenue column in descending order, using the slice() function to limit the dataframe to the top five movies. Then, add an addition geom_point() layer calling the topfive data, as shown below.
topfive <- bech_data %>%
arrange(desc(intgross)) %>%
slice(1:5)
ggplot(data = bech_data, aes(x = domgross, y = intgross)) +
geom_point() +
geom_point(data=topfive, aes(x=domgross, y = intgross), color = 'red') +
geom_text(data=topfive, label = topfive$title, nudge_y = 100)
Add a regression line to any scatter plot with geom_smooth(method = ‘lm’).
ggplot(data = bech_data, aes(x = domgross, y = intgross)) +
geom_point() +
geom_smooth(method = 'lm')
If you need another task, change the color to red and linetype to dashed in the previous scatter plot.
You can also display scatter plots by groups, distinguished by color or shape, as seen below. See if you can add a regression line to one of these plots.
# Scatter plot by group
plot1 <- ggplot(data = bech_data, aes(x = domgross, y = intgross, color = binary)) +
geom_point()
plot2 <- ggplot(data = bech_data, aes(x = domgross, y = intgross, shape = binary)) +
geom_point()
ggarrange(plot1,plot2,ncol=2)
You can also add rugs, or lines across the x-axis that are tied to single points of data, to the plot using geom_rug().
# Add rugs to scatter plot
ggplot(data = bech_data, aes(x = domgross, y = intgross, color = binary)) +
geom_point() +
geom_rug()
You can also create bar plots, or jitter plots. The bar plot below gives an indication of the extent to which each reason for failing was given, as well as how many times a movie passed the test. The jitter plot, however, shows the individual points from the dataset as they relate to the binary column mapped to the y-axis.
ggplot(data = bech_data, aes(x = clean_test)) +
geom_bar()
ggplot(data = bech_data, aes(x = clean_test, y = binary)) +
geom_jitter()
CRDDS: Consult hours: Tuesdays 12-1 and Thursdays 1-2 Events: http://www.colorado.edu/crdds/events Listserv: https://lists.colorado.edu/sympa/subscribe/crdds-news OSF: https://osf.io/36mj4/
Laboratory for Interdisciplinary Statistical Analysis (LISA): http://www.colorado.edu/lab/lisa/resources
Online:
dyplyr cheat sheet - data wrangling https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
Data Visualization Resources
ggplot cheat sheet https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
qplots http://www.sthda.com/english/wiki/qplot-quick-plot-with-ggplot2-r-software-and-data-visualization
Histograms/Density plots http://www.sthda.com/english/wiki/ggplot2-histogram-plot-quick-start-guide-r-software-and-data-visualization
R Markdown Cheatsheet https://rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf
Data Carpentry http://www.datacarpentry.org/R-genomics/01-intro-to-R.html
R manuals by CRAN https://cran.r-project.org/manuals.html
Basic Reference Card https://cran.r-project.org/doc/contrib/Short-refcard.pdf
R for Beginners (Emmanuel Paradis) https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf
The R Guide (W. J. Owen) https://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf
An Introduction to R (Longhow Lam) https://cran.r-project.org/doc/contrib/Lam-IntroductionToR_LHL.pdf
Cookbook for R http://www.cookbook-r.com/
Advanced R (Hadley Wickham) http://adv-r.had.co.nz/
rseek: search most online R documentation and discussion forums http://rseek.org/
The R Inferno: useful for trouble shooting errors http://www.burns-stat.com/documents/books/the-r-inferno/
Google: endless blogs, posted Q & A, tutorials, references guides where you’re often directed to sites such as Stackoverflow, Crossvalidated, and the R-help mailing list.
YouTube R channel https://www.youtube.com/user/TheLearnR
R Programming in Coursera https://www.coursera.org/learn/r-programming
Various R videos http://jeromyanglim.blogspot.co.uk/2010/05/videos-on-data-analysis-with-r.html
R for Data Science - Book http://r4ds.had.co.nz
Base R cheat sheet https://www.rstudio.com/wp-content/uploads/2016/05/base-r.pdf
dyplyr cheat sheet - data wrangling https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
ggplot cheat sheet - data visualization https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf