R For Data Science Cheat SheetTidyverse for Beginners

Learn More R for Data Science Interactively at www.datacamp.com


DataCampLearn R for Data Science Interactively

The tidyverse is a powerful collection of R packages that are actually data tools for transforming and visualizing data. All packages of the tidyverse share an underlying philosophy and common APIs.

The core packages are:

• ggplot2, which implements the grammar of graphics. You can use it to visualize your data.

• dplyr is a grammar of data manipulation. You can use it to solve the most common data manipulation challenges.

• tidyr helps you to create tidy data or data where each variable is in a column, each observation is a row end each value is a cell.

• readr is a fast and friendly way to read rectangular data.

• purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors.

• tibble is a modern re-imaginging of the data frame.

• stringr provides a cohesive set of functions designed to make working with strings as easy as posssible

• forcats provide a suite of useful tools that solve common problems with factors.

You can install the complete tidyverse with:

Then, load the core tidyverse and make it available in your current R session by running:

Note: there are many other tidyverse packages with more specialised usage. They are not loaded automatically with library(tidyverse), so you’ll need to load each one with its own call to library().


> install.packages("tidyverse")

> iris %>% Select iris data of species filter(Species=="virginica") "virginica" > iris %>% Select iris data of species filter(Species=="virginica", "virginica" and sepal length Sepal.Length > 6) greater than 6.



> library(tidyverse)

Useful Functions




> tidyverse_conflicts() Conflicts between tidyverse and other packages> tidyverse_deps() List all tidyverse dependencies> tidyverse_logo() Get tidyverse logo, using ASCII or unicode characters > tidyverse_packages() List all tidyverse packages> tidyverse_update() Update tidyverse packages

Loading in the data> library(datasets) Load the datasets package > library(gapminder) Load the gapminder package> attach(iris) Attach iris data to the R search path

filter() allows you to select a subset of rows in a data frame.

> iris %>% Sort in ascending order of arrange(Sepal.Length) sepal length> iris %>% Sort in descending order of arrange(desc(Sepal.Length)) sepal length

arrange() sorts the observations in a dataset in ascending or descending order based on one of its variables.

> iris %>% Filter for species "virginica" filter(Species=="virginica") %>% then arrange in descending arrange(desc(Sepal.Length)) order of sepal length

Combine multiple dplyr verbs in a row with the pipe operator %>%:

mutate() allows you to update or create new columns of a data frame.

> iris %>% Change Sepal.Length to be mutate(Sepal.Length=Sepal.Length*10) in millimeters> iris %>% Create a new column mutate(SLMm=Sepal.Length*10) called SLMm

Combine the verbs filter(), arrange(), and mutate():> iris %>% filter(Species=="Virginica") %>% mutate(SLMm=Sepal.Length*10) %>% arrange(desc(SLMm))

> iris %>% Summarize to find the summarize(medianSL=median(Sepal.Length)) median sepal length> iris %>% Filter for virginica then filter(Species=="virginica") %>% summarize the median summarize(medianSL=median(Sepal.Length)) sepal length

summarize() allows you to turn many observations into a single data point.

> iris %>% filter(Species=="virginica") %>% summarize(medianSL=median(Sepal.Length), maxSL=max(Sepal.Length))

You can also summarize multiple variables at once:

group_by() allows you to summarize within groups instead of summarizing the entire dataset:

> iris %>% Find median and max group_by(Species) %>% sepal length of each summarize(medianSL=median(Sepal.Length), species maxSL=max(Sepal.Length))> iris %>% Find median and max filter(Sepal.Length>6) %>% petal length of each group_by(Species) %>% species with sepal summarize(medianPL=median(Petal.Length), length > 6 maxPL=max(Petal.Length))

Scatter plot

> iris_small <- iris %>% filter(Sepal.Length > 5)> ggplot(iris_small, aes(x=Petal.Length, Compare petal y=Petal.Width)) + width and length geom_point()

Scatter plots allow you to compare two variables within your data. To do this with ggplot2, you use geom_point()

Additional Aesthetics

> ggplot(iris_small, aes(x=Petal.Length, y=Petal.Width, color=Species)) + geom_point()

• Color

• Size> ggplot(iris_small, aes(x=Petal.Length, y=Petal.Width, color=Species, size=Sepal.Length)) + geom_point()

Faceting> ggplot(iris_small, aes(x=Petal.Length, y=Petal.Width)) + geom_point()+ facet_wrap(~Species)

Line Plots

Bar Plots


Box Plots

> by_year <- gapminder %>% group_by(year) %>% summarize(medianGdpPerCap=median(gdpPercap))> ggplot(by_year, aes(x=year, y=medianGdpPerCap))+ geom_line()+ expand_limits(y=0)

> by_species <- iris %>% filter(Sepal.Length>6) %>% group_by(Species) %>% summarize(medianPL=median(Petal.Length))> ggplot(by_species, aes(x=Species, y=medianPL)) + geom_col()

> ggplot(iris_small, aes(x=Petal.Length))+ geom_histogram()

> ggplot(iris_small, aes(x=Species, y=Sepal.Width))+ geom_boxplot()

    Leave a Reply

    Your email address will not be published.