Instructors: Meike Niederhausen, PhD & Jessica Minnier, PhD
OCTRI Biostatistics, Epidemiology, Research & Design (BERD) Workshop
.Rproj
file.library(tidyverse)library(janitor)penguins <- read_csv("penguins")
%>%
The pipe operator %>%
is part of the tidyverse, and strings together commands to be performed sequentially
penguins %>% head(n=3) # prounounce %>% as "then"
## # A tibble: 3 x 9## id species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>## 1 1689 Adelie Torge… 39.1 18.7 181 3750## 2 4274 Adelie Torge… NA 17.4 186 3800## 3 4539 Adelie Torge… 40.3 18 195 3250## # … with 2 more variables: sex <chr>, year <dbl>
penguins %>% head(n=2) %>% summary()
$
vs summarize()
We saw how to summarize a vector pulled with $
, but there are easier ways to summarize multiple columns at once.
mean(penguins$body_mass_g)
## [1] 4201.754
median(penguins$body_mass_g)
## [1] 4050
penguins %>% summarize(mean(body_mass_g), median(body_mass_g))
## # A tibble: 1 x 2## `mean(body_mass_g)` `median(body_mass_g)`## <dbl> <dbl>## 1 4202. 4050
summarize()
with NA
na.rm = TRUE
if you need it.penguins %>% summarize(mean_mass = mean(body_mass_g), mean_len = mean(bill_length_mm, na.rm = TRUE))
## # A tibble: 1 x 2## mean_mass mean_len## <dbl> <dbl>## 1 4202. 44.0
summarize()
(1/2)group_by()
group_by()
is very powerful, see data wrangling cheatsheet# summary of all data as a wholepenguins %>% summarize(mass_mean =mean(body_mass_g), mass_sd = sd(body_mass_g), mass_cv = sd(body_mass_g)/mean(body_mass_g))
## # A tibble: 1 x 3## mass_mean mass_sd mass_cv## <dbl> <dbl> <dbl>## 1 4202. 802. 0.191
summarize()
(2/2)group_by()
group_by()
is very powerful, see data wrangling cheatsheet# summary by group variablepenguins %>% group_by(species) %>% summarize(n_per_group = n(), mass_mean =mean(body_mass_g), mass_sd = sd(body_mass_g), mass_cv = sd(body_mass_g)/mean(body_mass_g))
## # A tibble: 3 x 5## species n_per_group mass_mean mass_sd mass_cv## <chr> <int> <dbl> <dbl> <dbl>## 1 Adelie 151 3701. 459. 0.124 ## 2 Chinstrap 68 3733. 384. 0.103 ## 3 Gentoo 123 5076. 504. 0.0993
summarize(across())
(1/3)across()
to summarize multiple variables (more examples)penguins %>% summarize(across(c(body_mass_g, bill_depth_mm), mean))
## # A tibble: 1 x 2## body_mass_g bill_depth_mm## <dbl> <dbl>## 1 4202. 17.2
penguins %>% summarize(across(c(bill_length_mm, bill_depth_mm), mean, na.rm=TRUE))
## # A tibble: 1 x 2## bill_length_mm bill_depth_mm## <dbl> <dbl>## 1 44.0 17.2
summarize(across())
(2/3)across()
to summarize multiple variables and functions (more examples)penguins %>% summarize(across(c(body_mass_g, bill_depth_mm), c(m = mean, sd = sd)))
## # A tibble: 1 x 4## body_mass_g_m body_mass_g_sd bill_depth_mm_m bill_depth_mm_sd## <dbl> <dbl> <dbl> <dbl>## 1 4202. 802. 17.2 1.97
summarize(across())
(3/3)across()
to summarize based on true/false conditions (more examples)penguins %>% summarize( across(where(is.character), n_distinct))
## # A tibble: 1 x 3## species island sex## <int> <int> <int>## 1 3 3 3
penguins %>% summarize(across(where(is.numeric), min, na.rm=TRUE))
## # A tibble: 1 x 6## id bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 1001 32.1 13.1 172 2700 2007
count()
penguins %>% count(island)
## # A tibble: 3 x 2## island n## <chr> <int>## 1 Biscoe 167## 2 Dream 124## 3 Torgersen 51
penguins %>% count(species, island)
## # A tibble: 5 x 3## species island n## <chr> <chr> <int>## 1 Adelie Biscoe 44## 2 Adelie Dream 56## 3 Adelie Torgersen 51## 4 Chinstrap Dream 68## 5 Gentoo Biscoe 123
janitor
package's tabyl
function# default tablepenguins %>% tabyl(species)
## species n percent## Adelie 151 0.4415205## Chinstrap 68 0.1988304## Gentoo 123 0.3596491
# output can be treated as tibblepenguins%>%tabyl(species)%>%select(-n)
## species percent## Adelie 0.4415205## Chinstrap 0.1988304## Gentoo 0.3596491
adorn_
your table!
penguins %>% tabyl(species) %>% adorn_totals("row") %>% adorn_pct_formatting(digits=2)
## species n percent## Adelie 151 44.15%## Chinstrap 68 19.88%## Gentoo 123 35.96%## Total 342 100.00%
tabyl
s# default 2x2 tablepenguins %>% tabyl(species, sex)
## species female male NA_## Adelie 73 73 5## Chinstrap 34 34 0## Gentoo 58 61 4
What adornments does the tabyl to right have?
penguins %>% tabyl(species, sex) %>% adorn_percentages(denominator = "col") %>% adorn_totals("row") %>% adorn_pct_formatting(digits = 1) %>% adorn_ns()
## species female male NA_## Adelie 44.2% (73) 43.5% (73) 55.6% (5)## Chinstrap 20.6% (34) 20.2% (34) 0.0% (0)## Gentoo 35.2% (58) 36.3% (61) 44.4% (4)## Total 100.0% (165) 100.0% (168) 100.0% (9)
table
function, but it is clunkier and the output is not a data frame (or tibble). tabyl
spenguins %>% tabyl(species, island, sex)
## $female## species Biscoe Dream Torgersen## Adelie 22 27 24## Chinstrap 0 34 0## Gentoo 58 0 0## ## $male## species Biscoe Dream Torgersen## Adelie 22 28 23## Chinstrap 0 34 0## Gentoo 61 0 0## ## $NA_## species Biscoe Dream Torgersen## Adelie 0 1 4## Chinstrap 0 0 0## Gentoo 4 0 0
Continue adding code chunks to your Rmd (or, start a new one! But remember to load the libraries and data at the top.)
How many different years are in the data? (Hint: use tabyl()
or n_distinct()
)
Count the number of penguins measured each year.
Calculate the median body mass by each species and sex subgroup. Use summarize()
and group_by()
to do this.
Create a 2x2 table of number of penguins measured in each year by each island.
filter()
optionsSubset rows of data by specifying conditions within filter()
>
, <
, >=
, <=
==
&
(and) |
(or)is.na()
to filter based on missing values%in%
to filter based on group membership!
in front negates the statement, as in !is.na(sex)
!(species %in% c("Adelie", "Gentoo"))
penguins %>% filter(bill_length_mm > 55)
## # A tibble: 5 x 9## id species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>## 1 4026 Gentoo Biscoe 59.6 17 230 6050## 2 2415 Gentoo Biscoe 55.9 17 228 5600## 3 4629 Gentoo Biscoe 55.1 16 230 5850## 4 2009 Chinst… Dream 58 17.8 181 3700## 5 4452 Chinst… Dream 55.8 19.8 207 4000## # … with 2 more variables: sex <chr>, year <dbl>
filter()
practiceWhat do these commands do? Try them out:
penguins %>% filter(island == "Torgersen")penguins %>% filter(bill_length_mm/bill_depth_mm > 3) # can do mathpenguins %>% filter((body_mass_g < 3000) | (body_mass_g > 6000))# filter on multiple variables:penguins %>% filter(body_mass_g < 3000, bill_depth_mm < 20, sex == "female") penguins %>% filter(body_mass_g < 3000 & bill_depth_mm < 20 & sex == "female") penguins %>% filter(body_mass_g < 3000 | bill_depth_mm < 20 | sex == "female") penguins %>% filter(year == 2008) # note the use of == instead of just =penguins %>% filter(sex == "female")penguins %>% filter(!(species == "Adelie"))penguins %>% filter(species %in% c("Chinstrap", "Gentoo"))penguins %>% filter(is.na(sex))penguins %>% filter(!is.na(sex))
select()
columnspenguins %>% select(id, island, species, body_mass_g)
## # A tibble: 342 x 4## id island species body_mass_g## <dbl> <chr> <chr> <dbl>## 1 1689 Torgersen Adelie 3750## 2 4274 Torgersen Adelie 3800## 3 4539 Torgersen Adelie 3250## 4 2435 Torgersen Adelie 3450## 5 2326 Torgersen Adelie 3650## 6 2637 Torgersen Adelie 3625## 7 4443 Torgersen Adelie 4675## 8 2102 Torgersen Adelie 3475## 9 2975 Torgersen Adelie 4250## 10 3966 Torgersen Adelie 3300## # … with 332 more rows
There are many ways to select a set of variable names (columns):
var1:var20
: all columns from var1
to var20
-var1
: remove the columnvar1
-(var1:var20)
: remove all columns from var1
to var20
contains("mm")
, contains("_")
: all variable names that contain the specified stringstarts_with("a")
or ends_with("last")
: all variable names that start or end with the specified stringSee other examples in the data wrangling cheatsheet.
select()
practiceWhich columns are selected & in what order using these commands?
First guess and then try them out.
penguins %>% select(id:bill_length_mm)penguins %>% select(where(is.character))penguins %>% select(where(is.numeric))penguins %>% select(-id,-species)penguins %>% select(-(id:island))penguins %>% select(contains("bill"))penguins %>% select(starts_with("s"))penguins %>% select(-contains("mm"))
relocate()
columnsselect()
, plus special ones such as .before
and .after
penguins %>% relocate(year, body_mass_g)
## # A tibble: 342 x 9## year body_mass_g id species island bill_length_mm bill_depth_mm## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>## 1 2007 3750 1689 Adelie Torge… 39.1 18.7## 2 2007 3800 4274 Adelie Torge… NA 17.4## 3 2007 3250 4539 Adelie Torge… 40.3 18 ## 4 2007 3450 2435 Adelie Torge… 36.7 19.3## 5 2007 3650 2326 Adelie Torge… 39.3 20.6## 6 2007 3625 2637 Adelie Torge… 38.9 17.8## 7 2007 4675 4443 Adelie Torge… NA 19.6## 8 2007 3475 2102 Adelie Torge… 34.1 18.1## 9 2007 4250 2975 Adelie Torge… 42 20.2## 10 2007 3300 3966 Adelie Torge… 37.8 17.1## # … with 332 more rows, and 2 more variables: flipper_length_mm <dbl>,## # sex <chr>
relocate()
practiceWhat order are the columns in using these commands?
First guess and then try them out.
penguins %>% relocate(species:bill_length_mm)penguins %>% relocate(where(is.character))penguins %>% relocate(where(is.numeric))penguins %>% relocate(flipper_length_mm,.before = bill_length_mm)penguins %>% relocate(species, .after = island)penguins %>% relocate(species, .after = last_col())
<-
assignment operator to save a modified data framepenguins_sub <- penguins %>% select(id:island, sex)penguins_sub
## # A tibble: 342 x 4## id species island sex ## <dbl> <chr> <chr> <chr> ## 1 1689 Adelie Torgersen male ## 2 4274 Adelie Torgersen female## 3 4539 Adelie Torgersen female## 4 2435 Adelie Torgersen female## 5 2326 Adelie Torgersen male ## 6 2637 Adelie Torgersen female## 7 4443 Adelie Torgersen male ## 8 2102 Adelie Torgersen <NA> ## 9 2975 Adelie Torgersen <NA> ## 10 3966 Adelie Torgersen <NA> ## # … with 332 more rows
mutate()
the dataUse mutate()
to add new columns to a tibble
penguins <- penguins %>% mutate(bill_ratio = bill_length_mm / bill_depth_mm)# use = (not <- or ==) to define new variablepenguins %>% select(bill_ratio, bill_length_mm, bill_depth_mm)
## # A tibble: 342 x 3## bill_ratio bill_length_mm bill_depth_mm## <dbl> <dbl> <dbl>## 1 2.09 39.1 18.7## 2 NA NA 17.4## 3 2.24 40.3 18 ## 4 1.90 36.7 19.3## 5 1.91 39.3 20.6## 6 2.19 38.9 17.8## 7 NA NA 19.6## 8 1.88 34.1 18.1## 9 2.08 42 20.2## 10 2.21 37.8 17.1## # … with 332 more rows
mutate()
practiceWhat do the following commands do?
First guess and then try them out.
penguins <- penguins %>% mutate(bill_long = (bill_length_mm > 45))penguins <- penguins %>% mutate(male = (sex == "male"))penguins <- penguins %>% mutate(male2 = 1 * (sex == "male"))
rename()
columnsrename(new_name = old_name)
Code renames the column, but just prints output without saving the rename:
# This does not save the new namepenguins %>% rename(record = id)
## # A tibble: 342 x 10## record species island bill_length_mm bill_depth_mm flipper_length_…## <dbl> <chr> <chr> <dbl> <dbl> <dbl>## 1 1689 Adelie Torge… 39.1 18.7 181## 2 4274 Adelie Torge… NA 17.4 186## 3 4539 Adelie Torge… 40.3 18 195## 4 2435 Adelie Torge… 36.7 19.3 193## 5 2326 Adelie Torge… 39.3 20.6 190## 6 2637 Adelie Torge… 38.9 17.8 181## 7 4443 Adelie Torge… NA 19.6 195## 8 2102 Adelie Torge… 34.1 18.1 193## 9 2975 Adelie Torge… 42 20.2 190## 10 3966 Adelie Torge… 37.8 17.1 186## # … with 332 more rows, and 4 more variables: body_mass_g <dbl>, sex <chr>,## # year <dbl>, bill_ratio <dbl>
Code renames the column and overwrites penguins
with renamed column:
penguins2 <- penguins %>% rename(record = id)penguins2
## # A tibble: 342 x 10## record species island bill_length_mm bill_depth_mm flipper_length_…## <dbl> <chr> <chr> <dbl> <dbl> <dbl>## 1 1689 Adelie Torge… 39.1 18.7 181## 2 4274 Adelie Torge… NA 17.4 186## 3 4539 Adelie Torge… 40.3 18 195## 4 2435 Adelie Torge… 36.7 19.3 193## 5 2326 Adelie Torge… 39.3 20.6 190## 6 2637 Adelie Torge… 38.9 17.8 181## 7 4443 Adelie Torge… NA 19.6 195## 8 2102 Adelie Torge… 34.1 18.1 193## 9 2975 Adelie Torge… 42 20.2 190## 10 3966 Adelie Torge… 37.8 17.1 186## # … with 332 more rows, and 4 more variables: body_mass_g <dbl>, sex <chr>,## # year <dbl>, bill_ratio <dbl>
Create a new Rmd or continue in your current Rmd.
Create a dataset for just the Torgersen island penguins that are female.
Restrict the data to just Torgersen female penguins that weigh more than 3500 g.
Restrict the dataset from the previous step to include just the columns with the original body measurements.
Add a column for the difference in the flipper and bill lengths, and call it flipper_bill_diff
.
How many rows and columns does your final dataset have?
What are tidy data?
G. Grolemond & H. Wickham's R for Data Science
See BERD workshop Data Wrangling Part 1 slides for more info.
ggplot(data = penguins, aes(x = flipper_length_mm, y = bill_length_mm)) + geom_point()
ggplot(data = penguins, aes(x = flipper_length_mm)) + geom_histogram()
Start with simple, slowly add in additions/colors/etc
You are building a plot layer by layer! ++++++
At the beginning, just copy and paste examples that you want to edit until you understand what each function does
It will take some trial and error!
Watch BERD ggplot video for more instruction, and many customizations
ggplot(data = penguins, aes(x = flipper_length_mm, y = bill_length_mm, color = species)) + geom_point()+ labs( title = "Flipper & bill length", subtitle = "Palmer Station LTER", x = "Flipper length(mm)", y = "Bill length(mm)") + scale_color_viridis_d( name = "Penguin species") + theme_bw()
ggplot(data = penguins, aes(x = flipper_length_mm, fill = species)) + geom_histogram( alpha = 0.5, position = "identity") + labs( title = "Flipper length", x = "Flipper length(mm)", y = "Frequency") + scale_fill_viridis_d( name = "Penguin species") + theme_minimal()
ggplot(data = penguins, aes(x = species, y = flipper_length_mm)) + geom_boxplot(color="darkgrey", width = 0.3, show.legend = FALSE) + geom_jitter( aes(color = species), alpha = 0.5, show.legend = FALSE, position = position_jitter( width = 0.2, seed = 0)) + scale_color_manual( values = c("darkorange","purple", "cyan4")) + theme_minimal() + labs(x = "Species", y = "Flipper length (mm)")
ggplot(data = penguins, aes(x = species, fill = sex)) + geom_bar()
ggplot(data = penguins, aes(x = species, fill = sex)) + geom_bar(position = "dodge")
pct_data <- penguins %>% count(species, sex) %>% # filter(!is.na(sex)) %>% group_by(species) %>% mutate(pct = 100*n/sum(n))pct_data
## # A tibble: 8 x 4## # Groups: species [3]## species sex n pct## <chr> <chr> <int> <dbl>## 1 Adelie female 73 48.3 ## 2 Adelie male 73 48.3 ## 3 Adelie <NA> 5 3.31## 4 Chinstrap female 34 50 ## 5 Chinstrap male 34 50 ## 6 Gentoo female 58 47.2 ## 7 Gentoo male 61 49.6 ## 8 Gentoo <NA> 4 3.25
ggplot(data = pct_data, aes(x = species, y = pct, fill = sex)) + geom_col()+ ylab("Percent")
pct_data <- penguins %>% count(species, sex) %>% # filter(!is.na(sex)) %>% group_by(species) %>% mutate(pct = 100*n/sum(n))pct_data
## # A tibble: 8 x 4## # Groups: species [3]## species sex n pct## <chr> <chr> <int> <dbl>## 1 Adelie female 73 48.3 ## 2 Adelie male 73 48.3 ## 3 Adelie <NA> 5 3.31## 4 Chinstrap female 34 50 ## 5 Chinstrap male 34 50 ## 6 Gentoo female 58 47.2 ## 7 Gentoo male 61 49.6 ## 8 Gentoo <NA> 4 3.25
ggplot(data = pct_data, aes(x = species, y = pct, fill = sex)) + geom_col(position = "dodge") + ylab("Percent")
Continue adding code chunks to your Rmd (or, start a new one! But remember to load the libraries and data at the top.)
Make a scatter plot of bill depth vs bill length.
Add + geom_smooth(method="lm")
to the plot. What is this saying about the association between bill depth and length?
Now add color = species
to the aesthetic aes()
. Keep geom_smooth
. How do the associations look now?
factor
is a data type that saves character variables as categories (factor levels)
Using factor data types are useful for making plots and necessary for some statistical modeling functions
We recommend using commands from the forcats
package to work with factor data
See forcats
cheatsheet
and forcats
vignette
factor()
penguins <- penguins %>% mutate(sex_fac = factor(sex))levels(penguins$sex_fac) # factor levels are in alphanumeric order by default
## [1] "female" "male"
penguins %>% select(sex, sex_fac) %>% summary() # character vs. factor types
## sex sex_fac ## Length:342 female:165 ## Class :character male :168 ## Mode :character NA's : 9
penguins %>% select(sex, sex_fac) %>% str() # str for structure
## tibble [342 × 2] (S3: tbl_df/tbl/data.frame)## $ sex : chr [1:342] "male" "female" "female" "female" ...## $ sex_fac: Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 NA NA NA ...
fct_relevel()
penguins <- penguins %>% mutate(species_fac = factor(species))summary(penguins$species_fac) # levels are in alphanumeric order by default
## Adelie Chinstrap Gentoo ## 151 68 123
penguins <- penguins %>% mutate(species_fac = fct_relevel(species_fac, c("Adelie", "Gentoo", "Chinstrap")))summary(penguins$species_fac) # levels are specified order
## Adelie Gentoo Chinstrap ## 151 123 68
penguins <- penguins %>% mutate(species_fac2 = fct_collapse(species_fac, # collapse levels Adelie = c("Adelie"), Other = c("Gentoo", "Chinstrap")) )penguins %>% select(species_fac, species_fac2) %>% summary()
## species_fac species_fac2## Adelie :151 Adelie:151 ## Gentoo :123 Other :191 ## Chinstrap: 68
penguins %>% tabyl(species_fac, species_fac2)
## species_fac Adelie Other## Adelie 151 0## Gentoo 0 123## Chinstrap 0 68
save(penguins, file = "penguins.RData") # saving mydata within the data folder
You can load .RData files using the load() command:
load("penguins.RData")
write_csv(penguins, path = "my_penguin_data.csv")
Use ?
in front of function name in console. Try this:
??
(i.e ??dplyr
or ??read_csv
) for searching all documentation in installed packages (including unloaded packages)"Evaluation error: invalid type (closure) for variable '***'"
)Recommended viewing order of BERD workshops:
Getting started:
Basic help with installation and using Rstudio
Some of this is drawn from materials in online books/lessons:
Jessica Minnier: minnier@ohsu.edu
Meike Niederhausen: niederha@ohsu.edu
Instructors: Meike Niederhausen, PhD & Jessica Minnier, PhD
OCTRI Biostatistics, Epidemiology, Research & Design (BERD) Workshop
.Rproj
file.Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |