Data Wrangling in R with the Tidyverse (Part 2)

Jessica Minnier, PhD & Meike Niederhausen, PhD
OCTRI Biostatistics, Epidemiology, Research & Design (BERD) Workshop

2019/04/18 (Part 1) & 2019/04/25 (Part 2)
and again! 2019/05/16 (Part 1) & 2019/05/23 (Part 2)

slides: bit.ly/berd_tidy2
pdf: bit.ly/berd_tidy2_pdf

1 / 59

Load files for today's workshopOpen the slides of this workshop: bit.ly/berd_tidy2
If you haven't already,Open the pre-workshop homework
Follow steps 1-5
Download zip folder
Open berd_tidyverse_project.Rproj

Open a new R script and run the commands to the right

# install.packages("tidyverse","janitor","glue")  
library(tidyverse)
library(lubridate)  
library(janitor)
library(glue)
demo_data <- read_csv("data/yrbss_demo.csv")
qn_data <- read_csv("data/yrbss_qn.csv")

Allison Horst
2 / 59

Learning objectives

Previously, in Part 1:

What is data wrangling?
A few good practices in R/RStudio
What is tidy data?
What is tidyverse?
Manipulate data

Part 2:

Summarizing data
Combine (join) data sets
Reshaping data: wide vs. long formats
Data cleaning:
- Missing data
- Strings/character vectors
- Dates
- Factors
- Messy names

3 / 59

More data wrangling

Alison Horst

4 / 59

Review Part 1: manipulating data

Columns (part 1 slides)

select() to subset columns
rename() to rename columns
mutate() to add new columns or change values within existing columns
- separate() and unite() are shortcuts for specific mutate type operations

Rows (part 1 slides)

filter() to subset rows
- na.omit() and distinct() are shortcuts for specific filter type operations
arrange() to order the data

5 / 59

Changing all columns at once (part 1 slides)

mutate_all(), rename_all(): applies a function to all columns:
mutate_all(FUNCTION, FUNCTION_ARGUMENTS)

# mutate_all changes the data in all columns
demo_data %>% mutate_all(as.character) %>% head(2)

# A tibble: 2 x 8
  record age          sex    grade race4 race7 bmi     stweight
  <chr>  <chr>        <chr>  <chr> <chr> <chr> <chr>   <chr>   
1 931897 15 years old Female 10th  White White 17.179  54.43   
2 333862 17 years old Female 12th  White White 20.2487 57.15

# rename_all changes all column names
demo_data %>% rename_all(str_sub, end = 2) %>% head(3)

# A tibble: 3 x 8
      re ag               se     gr    ra           ra             bm    st
   <dbl> <chr>            <chr>  <chr> <chr>        <chr>       <dbl> <dbl>
1 931897 15 years old     Female 10th  White        White        17.2  54.4
2 333862 17 years old     Female 12th  White        White        20.2  57.2
3  36253 18 years old or… Male   11th  Hispanic/La… Hispanic/L…  NA    NA

6 / 59

Changing some columns at once (part 1 slides)

mutate_at(), rename_at(): uses vars() to select specific variables to apply a function to i.e. mutate_at(vars(SELECT), FUNCTION, FUNCTION_ARGUMENTS)

# mutate_at changes the data in specified columns
demo_data %>% mutate_at(vars(contains("race"), sex), as.factor) %>% head(2)

# A tibble: 2 x 8
  record age          sex    grade race4 race7   bmi stweight
   <dbl> <chr>        <fct>  <chr> <fct> <fct> <dbl>    <dbl>
1 931897 15 years old Female 10th  White White  17.2     54.4
2 333862 17 years old Female 12th  White White  20.2     57.2

# rename_at changes specified column names
demo_data %>% rename_at(vars(record:grade),toupper) %>% head(3)

# A tibble: 3 x 8
  RECORD AGE             SEX    GRADE race4       race7        bmi stweight
   <dbl> <chr>           <chr>  <chr> <chr>       <chr>      <dbl>    <dbl>
1 931897 15 years old    Female 10th  White       White       17.2     54.4
2 333862 17 years old    Female 12th  White       White       20.2     57.2
3  36253 18 years old o… Male   11th  Hispanic/L… Hispanic/…  NA       NA

7 / 59

Changing some columns at once (part 1 slides)

mutate_if(), rename_if(), select_if(): uses a function that returns TRUE/FALSE to select columns and applies function on the TRUE columns:
mutate_if(BOOLEAN, FUNCTION, FUNCTION_ARGUMENTS)

demo_data %>% mutate_if(is.numeric, round, digits = 0) %>% head(3)

# A tibble: 3 x 8
  record age             sex    grade race4       race7        bmi stweight
   <dbl> <chr>           <chr>  <chr> <chr>       <chr>      <dbl>    <dbl>
1 931897 15 years old    Female 10th  White       White         17       54
2 333862 17 years old    Female 12th  White       White         20       57
3  36253 18 years old o… Male   11th  Hispanic/L… Hispanic/…    NA       NA

demo_data %>% rename_if(is.character, str_sub, end = 2) %>% head(3)

# A tibble: 3 x 8
  record ag              se     gr    ra          ra           bmi stweight
   <dbl> <chr>           <chr>  <chr> <chr>       <chr>      <dbl>    <dbl>
1 931897 15 years old    Female 10th  White       White       17.2     54.4
2 333862 17 years old    Female 12th  White       White       20.2     57.2
3  36253 18 years old o… Male   11th  Hispanic/L… Hispanic/…  NA       NA

8 / 59

Add one or more rows: `add_row()`

demo_data %>% add_row(record=100, age=NA, sex="Female", grade="9th") %>%
  arrange(record) %>% head(3)

# A tibble: 3 x 8
  record age   sex    grade race4           race7             bmi stweight
   <dbl> <chr> <chr>  <chr> <chr>           <chr>           <dbl>    <dbl>
1    100 <NA>  Female 9th   <NA>            <NA>               NA       NA
2  30592 <NA>  <NA>   <NA>  <NA>            <NA>               NA       NA
3  30593 <NA>  <NA>   9th   Hispanic/Latino Hispanic/Latino    NA       NA

demo_data %>% add_row(record=100:102, bmi=c(25,30,18)) %>%
  arrange(record) %>% head(3)

# A tibble: 3 x 8
  record age   sex   grade race4 race7   bmi stweight
   <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>    <dbl>
1    100 <NA>  <NA>  <NA>  <NA>  <NA>     25       NA
2    101 <NA>  <NA>  <NA>  <NA>  <NA>     30       NA
3    102 <NA>  <NA>  <NA>  <NA>  <NA>     18       NA

9 / 59

Add one or more columns: `add_column()`

demo_data %>% add_column(study_date = "2019-04-10", .after="record") %>%
  head(3)

# A tibble: 3 x 9
  record study_date age        sex    grade race4    race7     bmi stweight
   <dbl> <chr>      <chr>      <chr>  <chr> <chr>    <chr>   <dbl>    <dbl>
1 931897 2019-04-10 15 years … Female 10th  White    White    17.2     54.4
2 333862 2019-04-10 17 years … Female 12th  White    White    20.2     57.2
3  36253 2019-04-10 18 years … Male   11th  Hispani… Hispan…  NA       NA

demo_data %>% add_column(id = 1:nrow(demo_data), .before="record") %>%
  head(3)

# A tibble: 3 x 9
     id record age          sex    grade race4     race7       bmi stweight
  <int>  <dbl> <chr>        <chr>  <chr> <chr>     <chr>     <dbl>    <dbl>
1     1 931897 15 years old Female 10th  White     White      17.2     54.4
2     2 333862 17 years old Female 12th  White     White      20.2     57.2
3     3  36253 18 years ol… Male   11th  Hispanic… Hispanic…  NA       NA

10 / 59

Quick tips on summarizing data

categorical data

numerical data

janitor, dplyr

11 / 59

Frequency tables: `janitor` package's `tabyl` function

# default table
demo_data %>% tabyl(grade)

 grade    n percent valid_percent
  10th 4907 0.24535     0.2504338
  11th 4891 0.24455     0.2496172
  12th 4577 0.22885     0.2335919
   9th 5219 0.26095     0.2663570
  <NA>  406 0.02030            NA

# output can be treated as tibble
demo_data %>% tabyl(grade) %>% select(-n)

 grade percent valid_percent
  10th 0.24535     0.2504338
  11th 0.24455     0.2496172
  12th 0.22885     0.2335919
   9th 0.26095     0.2663570
  <NA> 0.02030            NA

adorn_ your table!

demo_data %>%
  tabyl(grade) %>%
  adorn_totals("row") %>%
  adorn_pct_formatting(digits=2)

 grade     n percent valid_percent
  10th  4907  24.54%        25.04%
  11th  4891  24.45%        24.96%
  12th  4577  22.88%        23.36%
   9th  5219  26.10%        26.64%
  <NA>   406   2.03%             -
 Total 20000 100.00%       100.00%

12 / 59

2x2 `tabyl`s

# default 2x2 table
demo_data %>% tabyl(grade, sex)

 grade Female Male NA_
  10th   2332 2539  36
  11th   2365 2496  30
  12th   2277 2263  37
   9th   2492 2684  43
  <NA>    126  195  85

What adornments does the tabyl to right have?

demo_data %>% tabyl(grade, sex) %>%
  adorn_percentages(denominator = "col") %>%
  adorn_totals("row") %>%
  adorn_pct_formatting(digits = 1) %>%
  adorn_ns()

 grade        Female           Male          NA_
  10th  24.3% (2332)  24.9%  (2539)  15.6%  (36)
  11th  24.7% (2365)  24.5%  (2496)  13.0%  (30)
  12th  23.7% (2277)  22.2%  (2263)  16.0%  (37)
   9th  26.0% (2492)  26.4%  (2684)  18.6%  (43)
  <NA>   1.3%  (126)   1.9%   (195)  36.8%  (85)
 Total 100.0% (9592) 100.0% (10177) 100.0% (231)

Notice grade is not sorted in a pleasing way. We will learn how to deal with this when we discuss factors as a data type in R.
Base R has a table function, but it is clunkier and the output is not a data frame.
See the tabyl vignette for more information, adorn options, & 3-way tabyls

13 / 59

Numerical data summaries: `summarize()`

We can summarize data as a whole, or in groups with group_by()
group_by() is very powerful, see data wrangling cheatsheet
Can also use summarize_at(), summarize_if(), summarize_all()

# summary of all data as a whole
demo_data %>% 
  summarize(bmi_mean =mean(bmi,na.rm=TRUE),
            bmi_sd = sd(bmi,na.rm=TRUE))

# A tibble: 1 x 2
  bmi_mean bmi_sd
     <dbl>  <dbl>
1     23.5   4.99

What does na.rm=TRUE do and what happens if we leave it out?

# summary by group variable
demo_data %>% 
  group_by(grade) %>%
  summarize(n_per_group = n(), 
            bmi_mean =mean(bmi,na.rm=TRUE),
            bmi_sd = sd(bmi,na.rm=TRUE))

# A tibble: 5 x 4
  grade n_per_group bmi_mean bmi_sd
  <chr>       <int>    <dbl>  <dbl>
1 <NA>          406     23.5   6.45
2 10th         4907     23.2   4.76
3 11th         4891     23.8   4.89
4 12th         4577     24.2   5.20
5 9th          5219     22.8   4.92

14 / 59

Combining data sets

be careful!!!
dplyr data transformation cheatsheet

15 / 59

Rows (cases): paste data below each other

bind_rows() combines rows from different data sets
& accounts for different column names

data1

# A tibble: 2 x 4
     id name  height   age
  <int> <chr>  <dbl> <dbl>
1     1 Nina       2     4
2     2 Yi         1     2

data2

# A tibble: 3 x 4
     id name  height years
  <int> <chr>  <dbl> <dbl>
1     7 Bo       2       3
2     8 Al       1.7     1
3     9 Juan     1.8     2

bind_rows(data1,data2, .id = "group")

# A tibble: 5 x 6
  group    id name  height   age years
  <chr> <int> <chr>  <dbl> <dbl> <dbl>
1 1         1 Nina     2       4    NA
2 1         2 Yi       1       2    NA
3 2         7 Bo       2      NA     3
4 2         8 Al       1.7    NA     1
5 2         9 Juan     1.8    NA     2

dplyr data transformation cheatsheet

16 / 59

Columns (variables): DO NOT USE `bind_cols()`!!

bind_cols() blindly pastes columns next to each other without preserving order of variables that they have in common
- Use join to preserve ordering - see next slides

# datasets must have same number of rows to use bind_cols()
demo_sub <- demo_data %>% slice(1:20) # first 20 rows of demo_data
qn_sub <- qn_data %>% slice(1:20)     # first 20 rows of qn_data
bind_cols(demo_sub, qn_sub)           # blindly bind columns; 2nd record column got renamed

# A tibble: 20 x 13
   record age   sex   grade race4 race7   bmi stweight record1 q8    q12  
    <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>    <dbl>   <dbl> <chr> <chr>
 1 9.32e5 15 y… Fema… 10th  White White  17.2     54.4  922375 Neve… <NA> 
 2 3.34e5 17 y… Fema… 12th  White White  20.2     57.2   36718 Neve… <NA> 
 3 3.63e4 18 y… Male  11th  Hisp… Hisp…  NA       NA    187667 Did … <NA> 
 4 1.10e6 15 y… Male  10th  Blac… Blac…  28.0     85.7  109951 Did … <NA> 
 5 1.30e6 14 y… Male  9th   All … Mult…  24.5     66.7 1109733 Neve… <NA> 
 6 2.62e5 17 y… Male  9th   All … <NA>   NA       NA    761127 Most… <NA> 
 7 9.27e5 16 y… Male  11th  All … Asian  20.5     70.3  930811 Neve… <NA> 
 8 1.31e6 17 y… Male  12th  White White  19.3     59.0  401921 Rare… <NA> 
 9 5.06e5 18 y… Male  12th  Hisp… Hisp…  33.1    123.    42244 Rare… <NA> 
10 1.80e5 14 y… Male  10th  Blac… Blac…  NA       NA    271375 Alwa… <NA> 
11 1.10e6 15 y… Male  9th   <NA>  <NA>   17.1     45.4  639521 Rare… <NA> 
12 1.31e6 16 y… Male  10th  Hisp… Hisp…  21.8     66.7 1309762 Did … 6 to…
13 3.12e4 15 y… Male  10th  All … <NA>   NA       NA   1313278 Did … All …
14 1.09e5 16 y… Male  11th  White White  NA       NA    640176 Did … <NA> 
15 1.81e5 15 y… Male  9th   White White  NA       NA    925706 Did … <NA> 
16 1.31e6 15 y… Fema… 10th  White White  22.0     65.8 1306447 Neve… I di…
17 7.70e5 16 y… Fema… 10th  White White  32.4     86.2   31079 Neve… <NA> 
18 9.38e5 18 y… Fema… 12th  White White  21.7     64.9  185891 Some… <NA> 
19 1.31e6 16 y… Male  11th  White White  28.3    102.   270777 Alwa… <NA> 
20 2.72e5 18 y… Fema… 11th  White White  NA       NA   1313665 Neve… 0 da…
# … with 2 more variables: q31 <chr>, qn24 <lgl>

17 / 59

`join`ing your data sets

Join uses overlapping or selected columns to combine two or more data sets.
Also called "merging" or "mutating join".
Function names are based off of SQL operations for databases.

dplyr data transformation cheatsheet

18 / 59

`join` options visually

Hiroaki Yutani

19 / 59

Most commonly used: left_join()left_join(x,y) includes all observations in x, regardless of whether they match ones in y or not.
It includes all columns in y, but only rows that match x's observations.
df1 <- tibble(a = c(1, 2), b = 2:1)
df2 <- tibble(a = c(1, 3), c = 10:11)
df1
df2

# A tibble: 2 x 2
      a     b
  <dbl> <int>
1     1     2
2     2     1
# A tibble: 2 x 2
      a     c
  <dbl> <int>
1     1    10
2     3    11
left_join(df1, df2)

# A tibble: 2 x 3
      a     b     c
  <dbl> <int> <int>
1     1     2    10
2     2     1    NA
Which common column(s) were used to merge the datasets?  
What if we want to specify which columns to join by when merging?
see next slide...

20 / 59

Which columns will be used to join?

If no columns are specified to join by, then all overlapping (intersecting) column names will be used
Often we want to specify which columns to use,
- and also how to rename duplicated columns that were not merged

dplyr data transformation cheatsheet

21 / 59

Check for overlapping column names

Goal: merge the demographics (demo_data) and questionnaire (qn_data) together.

What column names do these datasets have in common?

colnames(demo_data)

[1] "record"   "age"      "sex"      "grade"    "race4"    "race7"   
[7] "bmi"      "stweight"

colnames(qn_data)

[1] "record" "q8"     "q12"    "q31"    "qn24"

intersect(colnames(demo_data), colnames(qn_data))

[1] "record"

22 / 59

Merge `demo_data` and `qn_data` together

Let's do a full join so that we keep all data from both datasets

merged_data <- 
  full_join(demo_data, qn_data, 
            by = "record")
# Check dimensions of original and new datasets

dim(demo_data); dim(qn_data); dim(merged_data)

[1] 20000     8

[1] 10000     5

[1] 20000    12

merged_data

# A tibble: 20,000 x 12
   record age   sex   grade race4 race7   bmi stweight q8    q12   q31  
    <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>    <dbl> <chr> <chr> <chr>
 1 9.32e5 15 y… Fema… 10th  White White  17.2     54.4 Neve… <NA>  Yes  
 2 3.34e5 17 y… Fema… 12th  White White  20.2     57.2 <NA>  <NA>  <NA> 
 3 3.63e4 18 y… Male  11th  Hisp… Hisp…  NA       NA   Neve… <NA>  <NA> 
 4 1.10e6 15 y… Male  10th  Blac… Blac…  28.0     85.7 Neve… <NA>  Yes  
 5 1.30e6 14 y… Male  9th   All … Mult…  24.5     66.7 <NA>  <NA>  <NA> 
 6 2.62e5 17 y… Male  9th   All … <NA>   NA       NA   <NA>  <NA>  <NA> 
 7 9.27e5 16 y… Male  11th  All … Asian  20.5     70.3 <NA>  <NA>  <NA> 
 8 1.31e6 17 y… Male  12th  White White  19.3     59.0 Neve… 3 to… No   
 9 5.06e5 18 y… Male  12th  Hisp… Hisp…  33.1    123.  <NA>  <NA>  <NA> 
10 1.80e5 14 y… Male  10th  Blac… Blac…  NA       NA   Neve… <NA>  No   
# … with 19,990 more rows, and 1 more variable: qn24 <lgl>

23 / 59

Learn more about `join`ing data

24 / 59

Practice

Add a column of 1's to qn_data called qn_yes and save the resulting data as qn_data2.
Join demo_data and qn_data2 by column record. Keep all rows from demo_data and only rows from qn_data2 that match records in demo_data. Call the resulting data all_data.
Create a tabyl() of qn_yes for the data all_data.
Create a 2x2 table of qn_yes vs grade.

Note about the data:

q8 = How often wear bicycle helmet
q12 = Texted while driving
q31 = Ever smoked
qn24 = Bullied past 12 months

25 / 59

Reshaping data

wide vs. long

Allison Horst

tidyr

26 / 59

Wide vs. long data

Wide data has one row per subject, with multiple columns for their repeated measurements
Long data has multiple rows per subject, with one column for the measurement variable and another indicating from when/where the repeated measures are from

wide

long

27 / 59

Example wide dataset

Copy and paste the code below into R to create this example dataset

BP_wide <- tibble(id = letters[1:4],
                     sex = c("F", "M", "M", "F"),
                     SBP_v1 = c(130, 120, 130, 119),
                     SBP_v2 = c(110, 116, 136, 106),
                     SBP_v3 = c(112, 122, 138, 118))
BP_wide

# A tibble: 4 x 5
  id    sex   SBP_v1 SBP_v2 SBP_v3
  <chr> <chr>  <dbl>  <dbl>  <dbl>
1 a     F        130    110    112
2 b     M        120    116    122
3 c     M        130    136    138
4 d     F        119    106    118

What do you think the data in the table are measures of?
How can we tell the data are wide?

28 / 59

Wide to long: `gather()`

BP_wide

# A tibble: 4 x 5
  id    sex   SBP_v1 SBP_v2 SBP_v3
  <chr> <chr>  <dbl>  <dbl>  <dbl>
1 a     F        130    110    112
2 b     M        120    116    122
3 c     M        130    136    138
4 d     F        119    106    118

gather columns into rows to make the data long. Need to specify:

new column names
- key: stores row names of wide data's gathered columns
- value: stores data values
which columns to gather

BP_long <- BP_wide %>% 
  gather(key = "visit", value = "SBP", 
         SBP_v1:SBP_v3)
BP_long

# A tibble: 12 x 4
   id    sex   visit    SBP
   <chr> <chr> <chr>  <dbl>
 1 a     F     SBP_v1   130
 2 b     M     SBP_v1   120
 3 c     M     SBP_v1   130
 4 d     F     SBP_v1   119
 5 a     F     SBP_v2   110
 6 b     M     SBP_v2   116
 7 c     M     SBP_v2   136
 8 d     F     SBP_v2   106
 9 a     F     SBP_v3   112
10 b     M     SBP_v3   122
11 c     M     SBP_v3   138
12 d     F     SBP_v3   118

29 / 59

Long to wide: `spread()`

BP_long

# A tibble: 12 x 4
   id    sex   visit    SBP
   <chr> <chr> <chr>  <dbl>
 1 a     F     SBP_v1   130
 2 b     M     SBP_v1   120
 3 c     M     SBP_v1   130
 4 d     F     SBP_v1   119
 5 a     F     SBP_v2   110
 6 b     M     SBP_v2   116
 7 c     M     SBP_v2   136
 8 d     F     SBP_v2   106
 9 a     F     SBP_v3   112
10 b     M     SBP_v3   122
11 c     M     SBP_v3   138
12 d     F     SBP_v3   118

spread rows into columns to make the data wide. Need to specify which columns in the long data to use:

key column: has the variable names
value column: has the data values

BP_wide2 <- BP_long %>% 
  spread(key = "visit", value = "SBP")
BP_wide2

# A tibble: 4 x 5
  id    sex   SBP_v1 SBP_v2 SBP_v3
  <chr> <chr>  <dbl>  <dbl>  <dbl>
1 a     F        130    110    112
2 b     M        120    116    122
3 c     M        130    136    138
4 d     F        119    106    118

30 / 59

Clean up long data's visit column (key column)

BP_long

# A tibble: 12 x 4
   id    sex   visit    SBP
   <chr> <chr> <chr>  <dbl>
 1 a     F     SBP_v1   130
 2 b     M     SBP_v1   120
 3 c     M     SBP_v1   130
 4 d     F     SBP_v1   119
 5 a     F     SBP_v2   110
 6 b     M     SBP_v2   116
 7 c     M     SBP_v2   136
 8 d     F     SBP_v2   106
 9 a     F     SBP_v3   112
10 b     M     SBP_v3   122
11 c     M     SBP_v3   138
12 d     F     SBP_v3   118

Goal: remove the string SBP_v from the visit variable's values.

BP_long2 <- BP_long %>% 
  mutate(visit = 
           str_replace(visit,"SBP_v","")) 
BP_long2

# A tibble: 12 x 4
   id    sex   visit   SBP
   <chr> <chr> <chr> <dbl>
 1 a     F     1       130
 2 b     M     1       120
 3 c     M     1       130
 4 d     F     1       119
 5 a     F     2       110
 6 b     M     2       116
 7 c     M     2       136
 8 d     F     2       106
 9 a     F     3       112
10 b     M     3       122
11 c     M     3       138
12 d     F     3       118

31 / 59

Make cleaned-up long data wide

head(BP_long2, 2)

# A tibble: 2 x 4
  id    sex   visit   SBP
  <chr> <chr> <chr> <dbl>
1 a     F     1       130
2 b     M     1       120

BP_wide3 <- BP_long2 %>% 
  spread(key = "visit", value = "SBP")
BP_wide3

# A tibble: 4 x 5
  id    sex     `1`   `2`   `3`
  <chr> <chr> <dbl> <dbl> <dbl>
1 a     F       130   110   112
2 b     M       120   116   122
3 c     M       130   136   138
4 d     F       119   106   118

Problem: have numbers as column names, since spread's default is to use the levels of the key as the new row names.

Solution: have row names start with the key column's name separated by a character

BP_wide4 <- BP_long2 %>% 
  spread(key = "visit", value = "SBP",
         sep="_") # specify separating character
BP_wide4

# A tibble: 4 x 5
  id    sex   visit_1 visit_2 visit_3
  <chr> <chr>   <dbl>   <dbl>   <dbl>
1 a     F         130     110     112
2 b     M         120     116     122
3 c     M         130     136     138
4 d     F         119     106     118

32 / 59

Practice

Copy and paste the code below into R to create the dataset DBP_wide

DBP_wide <- tibble(id = letters[1:4],
                  sex = c("F", "M", "M", "F"),
                  v1.DBP = c(88, 84, 102, 70),
                  v2.DBP = c(78, 78, 96, 76),
                  v3.DBP = c(94, 82, 94, 74),
                  age=c(23, 56, 41, 38)
                  )

Make DBP_wide into a long dataframe based on the repeated DBP columns and save it as DBP_long.
Clean up the visit column of DBP_long so that the values are 1, 2, 3, and save it as DBP_long.
Make DBP_long wide with column names visit.1, visit.2, visit.3 for the DBP values, and save it as DBP_wide2.
Join DBP_long with BP_long2 so that we have one data frame with columns id, sex, visit, SBP, DBP, and age. Save this as BP_both_long.

33 / 59

Data cleaning

(messy NAs, names, strings, dates, factors)

34 / 59

Removing missing data: `drop_na()`

A small data example:

mydata <- tibble(id = 7:9, 
                 name = c("Bo","Al","Juan"), 
                 height = c(2, NA, 1.8), 
                 years = c(51,35,NA))
mydata

# A tibble: 3 x 4
     id name  height years
  <int> <chr>  <dbl> <dbl>
1     7 Bo       2      51
2     8 Al      NA      35
3     9 Juan     1.8    NA

Remove all rows with any missing data

mydata %>% drop_na()

# A tibble: 1 x 4
     id name  height years
  <int> <chr>  <dbl> <dbl>
1     7 Bo         2    51

Remove rows with NA in selected columns

mydata %>% drop_na(height)

# A tibble: 2 x 4
     id name  height years
  <int> <chr>  <dbl> <dbl>
1     7 Bo       2      51
2     9 Juan     1.8    NA

35 / 59

Replace `NA`s with another value: `replace_na()`

Use with mutate()

mydata

# A tibble: 3 x 4
     id name  height years
  <int> <chr>  <dbl> <dbl>
1     7 Bo       2      51
2     8 Al      NA      35
3     9 Juan     1.8    NA

mydata %>% 
  mutate(height = replace_na(height, "Unknown"),
         years = replace_na(years, 0) )

# A tibble: 3 x 4
     id name  height  years
  <int> <chr> <chr>   <dbl>
1     7 Bo    2          51
2     8 Al    Unknown    35
3     9 Juan  1.8         0

36 / 59

`replace_na()` advanced example

Replaces NAs in all columns starting with "q" with the string "No answer"

qn_data %>% 
  mutate_at(vars(starts_with("q")),
            .funs = list(~replace_na(.,"No answer"))) %>%
  tabyl(q8, q31)

                             q8   No No answer  Yes
           Always wore a helmet  411        58  216
         Did not ride a bicycle 1318       343 1223
 Most of the time wore a helmet  320        31  173
            Never wore a helmet 1429       418 1672
                      No answer  267       212  300
           Rarely wore a helmet  481        73  405
        Sometimes wore a helmet  351        53  246

37 / 59

Convert (i.e. "No answer", 9999, etc) to `NA`: `na_if()`

all_data %>% tabyl(race4)

                     race4    n percent valid_percent
           All other races 4713 0.23565     0.2443235
 Black or African American 4093 0.20465     0.2121825
           Hispanic/Latino 4670 0.23350     0.2420943
                     White 5814 0.29070     0.3013997
                      <NA>  710 0.03550            NA

all_data %>%
  mutate(race4 = na_if(race4, "All other races")) %>%
  tabyl(race4)

                     race4    n percent valid_percent
 Black or African American 4093 0.20465     0.2807848
           Hispanic/Latino 4670 0.23350     0.3203677
                     White 5814 0.29070     0.3988475
                      <NA> 5423 0.27115            NA

38 / 59

`na_if()` for all your data

Avoid this by reading in your data correctly:

smalldata <- read_csv("data/small_data.csv",
                      na = c("","9999","NA")) # specify your own missing values

Otherwise na_if() everything:

# replace all "" with NA
all_data %>%
  mutate_if(is.character, .funs = na_if(.,""))
# replace all 9999's with NA 
all_data %>%
  mutate_if(is.numeric, .funs = na_if(.,9999))

39 / 59

Working with character strings

Use the package stringr (loaded with tidyverse)
Paste strings or values together with package glue (installed, not loaded w/ tidyverse)
advanced tip: learn "regular expressions" (regex) for pattern matching (see cheatsheet) and matching multiple characters/strings at once

stringr cheatsheet

40 / 59

`str_detect()` find strings

mydata <- tibble(name = c("J.M.","Ella","Jay"), state = c("New Mexico","New York","Oregon"))

Filter based on string detection

mydata %>% filter(str_detect(name,"J"))

# A tibble: 2 x 2
  name  state     
  <chr> <chr>     
1 J.M.  New Mexico
2 Jay   Oregon

Creates a column of TRUE/FALSE if detected

mydata %>% mutate(
  new_state = str_detect(state,"New"))

# A tibble: 3 x 3
  name  state      new_state
  <chr> <chr>      <lgl>    
1 J.M.  New Mexico TRUE     
2 Ella  New York   TRUE     
3 Jay   Oregon     FALSE

41 / 59

`str_replace_all()`, `str_replace()`

mydata %>% mutate(state_old = str_replace_all(state, "New", "Old"))

# A tibble: 3 x 3
  name  state      state_old 
  <chr> <chr>      <chr>     
1 J.M.  New Mexico Old Mexico
2 Ella  New York   Old York  
3 Jay   Oregon     Oregon

mydata %>% mutate(
  name2 = str_replace(name, "l", "-"),           # first instance
  name3 = str_replace_all(name, "l", "-"),       # all instances
  name4 = str_replace_all(name, fixed("."), "")) # special characters with fixed()

# A tibble: 3 x 5
  name  state      name2 name3 name4
  <chr> <chr>      <chr> <chr> <chr>
1 J.M.  New Mexico J.M.  J.M.  JM   
2 Ella  New York   E-la  E--a  Ella 
3 Jay   Oregon     Jay   Jay   Jay

42 / 59

`str_sub()`: shorten strings

Based on position 1 (start = 1) to length of string (end = -1)

mydata %>% mutate(
  short_name  = str_sub(name, start = 1, end = 2),   # specify start to end
  short_name2 = str_sub(name, end = 2),              # specify only end
  short_state = str_sub(state, end = -3)             # negative endices, from end
  )

# A tibble: 3 x 5
  name  state      short_name short_name2 short_state
  <chr> <chr>      <chr>      <chr>       <chr>      
1 J.M.  New Mexico J.         J.          New Mexi   
2 Ella  New York   El         El          New Yo     
3 Jay   Oregon     Ja         Ja          Oreg

43 / 59

Paste strings together with `glue()`

paste() is the base R way of pasting strings (surprise, it's hard to use)
glue() is most useful when pasting data columns together
column names or function operations go inside {}
See the glue vignette

all_data %>%
  mutate(info = glue("Student {record} is {age} with BMI = {round(bmi,1)}")) %>%
  select(record, info) %>% head(5)

# A tibble: 5 x 2
   record info                                                
    <dbl> <S3: glue>                                          
1  931897 Student 931897 is 15 years old with BMI = 17.2      
2  333862 Student 333862 is 17 years old with BMI = 20.2      
3   36253 Student 36253 is 18 years old or older with BMI = NA
4 1095530 Student 1095530 is 15 years old with BMI = 28       
5 1303997 Student 1303997 is 14 years old with BMI = 24.5

44 / 59

Using `glue` to summarize data

Useful for tables (will cover this more in another session)
Example, calculate the S.E. of the mean and create a column with "mean (SE)" of bmi:

demo_data %>% 
  group_by(sex) %>%
  summarize(n_sex = n(),
            bmi_mean = mean(bmi,na.rm=TRUE),
            bmi_sd = sd(bmi,na.rm=TRUE)) %>%
  mutate(bmi_mean_se = glue("{round(bmi_mean,1)} ({signif(bmi_sd/sqrt(n_sex),2)})"))

# A tibble: 3 x 5
  sex    n_sex bmi_mean bmi_sd bmi_mean_se 
  <chr>  <int>    <dbl>  <dbl> <S3: glue>  
1 <NA>     231    NaN   NaN    NaN (NaN)   
2 Female  9592     23.3   4.96 23.3 (0.051)
3 Male   10177     23.7   5.01 23.7 (0.05)

45 / 59

Wrangle dates with `lubridate`

Convert characters to special "Date" type
Convert terrible excel date formats into workable data
Easy date magic examples:
- add and subtract dates
- convert to minutes/years/etc
- change timezones
- add 1 month to a date...
lubridate cheat sheet
read_csv and read_excel etc automatically import dates correctly

Allison Horst

46 / 59

What kind of date do you have?
lubridate cheat sheet
timedata <- 
 tibble(name = c("Yi","Bo","DJ"), 
        dob=c("10/31/1952","1/12/1984","2/02/2002"))
timedata %>% 
  mutate(dob_date = mdy(dob),
         dob_wrong = dmy(dob)) # wrong order

# A tibble: 3 x 4
  name  dob        dob_date   dob_wrong 
  <chr> <chr>      <date>     <date>    
1 Yi    10/31/1952 1952-10-31 NA        
2 Bo    1/12/1984  1984-01-12 1984-12-01
3 DJ    2/02/2002  2002-02-02 2002-02-02
47 / 59

Math with dates

timedata %>% mutate(
  dob = mdy(dob),                            # convert to a date
  dob_year = year(dob),                      # extract the year
  time_since_birth = dob %--% today(),       # create an "interval"
  age = time_since_birth %/% years(1),       # modulus on "years"
  dobplus = dob + days(10)                   # add 10 days
  )

# A tibble: 3 x 6
  name  dob        dob_year time_since_birth                 age dobplus   
  <chr> <date>        <dbl> <S4: Interval>                 <dbl> <date>    
1 Yi    1952-10-31     1952 1952-10-31 UTC--2019-05-23 UTC    66 1952-11-10
2 Bo    1984-01-12     1984 1984-01-12 UTC--2019-05-23 UTC    35 1984-01-22
3 DJ    2002-02-02     2002 2002-02-02 UTC--2019-05-23 UTC    17 2002-02-12

48 / 59

Factors - categorical data

Clean and order factors with forcats package
Will go over this for ggplot2 (visualization), statistical modeling (i.e. for lm()), and probably a workshop for creating tables
See forcats cheatsheet and forcats vignette

49 / 59

forcats examples - specify levels fct_relevel()mydata <- tibble(
  id = 1:4, 
  grade=c("9th","10th","11th","9th")) %>%
  mutate(grade_fac = factor(grade))
levels(mydata$grade_fac)

[1] "10th" "11th" "9th"
mydata %>% arrange(grade_fac)

# A tibble: 4 x 3
     id grade grade_fac
  <int> <chr> <fct>    
1     2 10th  10th     
2     3 11th  11th     
3     1 9th   9th      
4     4 9th   9th
mydata <- mydata %>% 
  mutate(
    grade_fac =
      fct_relevel(grade_fac,
                  c("9th","10th","11th")))
levels(mydata$grade_fac)

[1] "9th"  "10th" "11th"
mydata %>% arrange(grade_fac)

# A tibble: 4 x 3
     id grade grade_fac
  <int> <chr> <fct>    
1     1 9th   9th      
2     4 9th   9th      
3     2 10th  10th     
4     3 11th  11th
50 / 59

`forcats` examples - collapse levels

mydata <- tibble(loc = c("SW","NW","NW","NE","SE","SE"))
mydata %>% mutate(
  loc_fac = factor(loc),
  loc2 = fct_collapse(loc_fac,                         # collapse levels
                      south = c("SW","SE"),
                      north = c("NE","NW")),
  loc3 = fct_lump(loc_fac, n=2, other_level = "other") # most common 2 levels + other
  )

# A tibble: 6 x 4
  loc   loc_fac loc2  loc3 
  <chr> <fct>   <fct> <fct>
1 SW    SW      south other
2 NW    NW      north NW   
3 NW    NW      north NW   
4 NE    NE      north other
5 SE    SE      south SE   
6 SE    SE      south SE

51 / 59

Other "janitor" work

52 / 59

Clean messy column names with `clean_names()`

mydata <- tibble("First Name"= c("Yi","DJ"), "last init" = c("C","R"),
                 "% in" = c(0.1, 0.5), "ñ$$$"= 1:2, " "=3:2,"     hi"=c("a","b"), 
                 "null"=c(NA,NA))
mydata

# A tibble: 2 x 7
  `First Name` `last init` `% in` `ñ$$$`   ` ` `     hi` null 
  <chr>        <chr>        <dbl>  <int> <int> <chr>     <lgl>
1 Yi           C              0.1      1     3 a         NA   
2 DJ           R              0.5      2     2 b         NA

mydata %>% clean_names() %>%        # in the janitor package
  remove_empty(c("rows","cols"))    # also useful

# A tibble: 2 x 6
  first_name last_init percent_in     n     x hi   
  <chr>      <chr>          <dbl> <int> <int> <chr>
1 Yi         C                0.1     1     3 a    
2 DJ         R                0.5     2     2 b

53 / 59

Clean names of your excel sheet

library(readxl)
read_excel("data/messy_names.xlsx", .name_repair = janitor::make_clean_names)

# A tibble: 20,000 x 9
    record age   grade_string race  race_2 bmi_kg_m_2 weight_kg x     x_2  
     <dbl> <chr> <chr>        <chr> <chr>       <dbl>     <dbl> <chr> <chr>
 1  931897 15 y… 10th         White White        17.2      54.4 some… <NA> 
 2  333862 17 y… 12th         White White        20.2      57.2 notes <NA> 
 3   36253 18 y… 11th         Hisp… Hispa…       NA        NA   I th… <NA> 
 4 1095530 15 y… 10th         Blac… Black…       28.0      85.7 why   WHY? 
 5 1303997 14 y… 9th          All … Multi…       24.5      66.7 <NA>  <NA> 
 6  261619 17 y… 9th          All … <NA>         NA        NA   <NA>  <NA> 
 7  926649 16 y… 11th         All … Asian        20.5      70.3 <NA>  <NA> 
 8 1309082 17 y… 12th         White White        19.3      59.0 <NA>  <NA> 
 9  506337 18 y… 12th         Hisp… Hispa…       33.1     123.  <NA>  <NA> 
10  180494 14 y… 10th         Blac… Black…       NA        NA   <NA>  <NA> 
# … with 19,990 more rows

54 / 59

Practice

Copy and paste the code below into R to create the dataset messy_data

messy_data <- tibble(NAME = c("J N","A C","D E"), 
                     `months follow up` = c("", 10, 11), 
                     `Date of visit` = c("July 31, 2003", "Nov 12, 2005", "Aug 3, 2007"))

Clean column names with clean_names().
Replace missing ("") data in months_follow_up with NA.
Convert months_follow_up to a numeric variable.
Convert date_of_visit to a date.
Create a column called date_last_visit that is the date of visit plus months of follow up.
Remove rows (cases) with missing data in months_follow_up.
Remove the spaces in name.

55 / 59

This was a lot, but learning R gets easier!

Allison Horst

56 / 59

Resources - `tidyverse` & data wrangling

Links

Some of this is drawn from materials in online books/lessons:

R for Data Science - by Garrett Grolemund & Hadley Wickham
Modern Dive - An Introduction to Statistical and Data Sciences via R by Chester Ismay & Albert Kim
A gRadual intRoduction to the tidyverse - Workshop for Cascadia R 2017 by Chester Ismay and Ted Laderas
"Tidy Data" by Hadley Wickham

57 / 59

Possible future workshop topics:

reproducible reports in R (probably next)
- R Markdown and knitr
- Create dynamic Word, html, pdf documents with code + output
tables
ggplot2 visualization
advanced tidyverse: functions, purrr
statistical modeling in R

Fill out feedback forms to suggest more or help us prioritize!

58 / 59

Contact info:

Jessica Minnier: minnier@ohsu.edu

Meike Niederhausen: niederha@ohsu.edu

This workshop info:

Code for these slides on github: jminnier/berd_r_courses
all the R code in an R script
answers to practice problems can be found here: html

59 / 59

Load files for today's workshop

Open the slides of this workshop: bit.ly/berd_tidy2
If you haven't already,
- Open the pre-workshop homework
- Follow steps 1-5
- Download zip folder
- Open berd_tidyverse_project.Rproj
Open a new R script and run the commands to the right

# install.packages("tidyverse","janitor","glue")  
library(tidyverse)
library(lubridate)  
library(janitor)
library(glue)
demo_data <- read_csv("data/yrbss_demo.csv")
qn_data <- read_csv("data/yrbss_qn.csv")

Allison Horst

2 / 59

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Data Wrangling in R with the Tidyverse (Part 2)

Jessica Minnier, PhD & Meike Niederhausen, PhDOCTRI Biostatistics, Epidemiology, Research & Design (BERD) Workshop

2019/04/18 (Part 1) & 2019/04/25 (Part 2) and again! 2019/05/16 (Part 1) & 2019/05/23 (Part 2) slides: bit.ly/berd_tidy2 pdf: bit.ly/berd_tidy2_pdf

Load files for today's workshop

Learning objectives

More data wrangling

Review Part 1: manipulating data

Columns (part 1 slides)

Rows (part 1 slides)

Changing all columns at once (part 1 slides)

Changing some columns at once (part 1 slides)

Changing some columns at once (part 1 slides)

Add one or more rows: add_row()

Add one or more columns: add_column()

Quick tips on summarizing data

categorical data

numerical data

Frequency tables: janitor package's tabyl function

2x2 tabyls

Numerical data summaries: summarize()

Combining data sets

Rows (cases): paste data below each other

Columns (variables): DO NOT USE bind_cols()!!

joining your data sets

join options visually

Most commonly used: left_join()

Which columns will be used to join?

Check for overlapping column names

Merge demo_data and qn_data together

Learn more about joining data

Practice

Reshaping data

Wide vs. long data

Example wide dataset

Wide to long: gather()

Long to wide: spread()

Clean up long data's visit column (key column)

Make cleaned-up long data wide

Practice

Data cleaning

(messy NAs, names, strings, dates, factors)

Removing missing data: drop_na()

Replace NAs with another value: replace_na()

replace_na() advanced example

Convert (i.e. "No answer", 9999, etc) to NA: na_if()

na_if() for all your data

Working with character strings

str_detect() find strings

str_replace_all(), str_replace()

str_sub(): shorten strings

Paste strings together with glue()

Using glue to summarize data

Wrangle dates with lubridate

What kind of date do you have?

Math with dates

Factors - categorical data

forcats examples - specify levels fct_relevel()

forcats examples - collapse levels

Other "janitor" work

Clean messy column names with clean_names()

Clean names of your excel sheet

Practice

This was a lot, but learning R gets easier!

Resources - tidyverse & data wrangling

Possible future workshop topics:

Contact info:

This workshop info:

Load files for today's workshop

Help

Jessica Minnier, PhD & Meike Niederhausen, PhD
OCTRI Biostatistics, Epidemiology, Research & Design (BERD) Workshop

2019/04/18 (Part 1) & 2019/04/25 (Part 2)
and again! 2019/05/16 (Part 1) & 2019/05/23 (Part 2)

slides: bit.ly/berd_tidy2
pdf: bit.ly/berd_tidy2_pdf

Add one or more rows: `add_row()`

Add one or more columns: `add_column()`

Frequency tables: `janitor` package's `tabyl` function

2x2 `tabyl`s

Numerical data summaries: `summarize()`

Columns (variables): DO NOT USE `bind_cols()`!!

`join`ing your data sets

`join` options visually

Most commonly used: `left_join()`

Merge `demo_data` and `qn_data` together

Learn more about `join`ing data

Wide to long: `gather()`

Long to wide: `spread()`

Removing missing data: `drop_na()`

Replace `NA`s with another value: `replace_na()`

`replace_na()` advanced example

Convert (i.e. "No answer", 9999, etc) to `NA`: `na_if()`

`na_if()` for all your data

`str_detect()` find strings

`str_replace_all()`, `str_replace()`

`str_sub()`: shorten strings

Paste strings together with `glue()`

Using `glue` to summarize data

Wrangle dates with `lubridate`

`forcats` examples - specify levels `fct_relevel()`

`forcats` examples - collapse levels

Clean messy column names with `clean_names()`

Resources - `tidyverse` & data wrangling