|
| 1 | +% Managing Data Frames with `dplyr` |
| 2 | +% Biostatistics 140.776 |
| 3 | +% |
| 4 | + |
| 5 | +```{r, echo=FALSE, results="hide"} |
| 6 | +options(width = 50) |
| 7 | +``` |
| 8 | + |
| 9 | +# dplyr |
| 10 | + |
| 11 | +The data frame is a key data structure in statistics and in R. |
| 12 | + |
| 13 | +* There is one observation per row |
| 14 | + |
| 15 | +* Each column represents a variable or measure or characteristic |
| 16 | + |
| 17 | +* Primary implementation that you will use is the default R |
| 18 | + implementation |
| 19 | + |
| 20 | +* Other implementations, particularly relational databases systems |
| 21 | + |
| 22 | + |
| 23 | +# dplyr |
| 24 | + |
| 25 | +* Developed by Hadley Wickham of RStudio |
| 26 | + |
| 27 | +* An optimized and distilled version of `plyr` package (also by Hadley) |
| 28 | + |
| 29 | +* Does not provide any "new" functionality per se, but **greatly** |
| 30 | + simplifies existing functionality in R |
| 31 | + |
| 32 | +* Provides a "grammar" (in particular, verbs) for data manipulation |
| 33 | + |
| 34 | +* Is **very** fast, as many key operations are coded in C++ |
| 35 | + |
| 36 | + |
| 37 | +# dplyr Verbs |
| 38 | + |
| 39 | +* `select`: return a subset of the columns of a data frame |
| 40 | + |
| 41 | +* `filter`: extract a subset of rows from a data frame based on |
| 42 | + logical conditions |
| 43 | + |
| 44 | +* `arrange`: reorder rows of a data frame |
| 45 | + |
| 46 | + |
| 47 | +* `rename`: rename variables in a data frame |
| 48 | + |
| 49 | +* `mutate`: add new variables/columns or transform existing variables |
| 50 | + |
| 51 | +* `summarise` / `summarize`: generate summary statistics of different |
| 52 | + variables in the data frame, possibly within strata |
| 53 | + |
| 54 | +There is also a handy `print` method that prevents you from printing a |
| 55 | +lot of data to the console. |
| 56 | + |
| 57 | + |
| 58 | + |
| 59 | +# dplyr Properties |
| 60 | + |
| 61 | +* The first argument is a data frame. |
| 62 | + |
| 63 | +* The subsequent arguments describe what to do with it, and you can |
| 64 | + refer to columns in the data frame directly without using the $ |
| 65 | + operator (just use the names). |
| 66 | + |
| 67 | +* The result is a new data frame |
| 68 | + |
| 69 | +* Data frames must be properly formatted and annotated for this to all |
| 70 | + be useful |
| 71 | + |
| 72 | + |
| 73 | +# Load the `dplyr` package |
| 74 | + |
| 75 | + |
| 76 | +This step is important! |
| 77 | + |
| 78 | +```{r} |
| 79 | +library(dplyr) |
| 80 | +``` |
| 81 | + |
| 82 | + |
| 83 | +# `select` |
| 84 | + |
| 85 | +```{r} |
| 86 | +chicago <- readRDS("chicago.rds") |
| 87 | +dim(chicago) |
| 88 | +head(select(chicago, 1:5)) |
| 89 | +``` |
| 90 | + |
| 91 | + |
| 92 | +# `select` |
| 93 | + |
| 94 | +```{r} |
| 95 | +names(chicago)[1:3] |
| 96 | +head(select(chicago, city:dptp)) |
| 97 | +``` |
| 98 | + |
| 99 | +# `select` |
| 100 | + |
| 101 | +In dplyr you can do |
| 102 | + |
| 103 | +```{r,eval=FALSE} |
| 104 | +head(select(chicago, -(city:dptp))) |
| 105 | +``` |
| 106 | + |
| 107 | +Equivalent base R |
| 108 | + |
| 109 | +```{r,eval=FALSE} |
| 110 | +i <- match("city", names(chicago)) |
| 111 | +j <- match("dptp", names(chicago)) |
| 112 | +head(chicago[, -(i:j)]) |
| 113 | +``` |
| 114 | + |
| 115 | + |
| 116 | + |
| 117 | +# `filter` |
| 118 | + |
| 119 | +```{r} |
| 120 | +chic.f <- filter(chicago, pm25tmean2 > 30) |
| 121 | +head(select(chic.f, 1:3, pm25tmean2), 10) |
| 122 | +``` |
| 123 | + |
| 124 | +# `filter` |
| 125 | + |
| 126 | +```{r} |
| 127 | +chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80) |
| 128 | +head(select(chic.f, 1:3, pm25tmean2, tmpd), 10) |
| 129 | +``` |
| 130 | + |
| 131 | + |
| 132 | +# `arrange` |
| 133 | + |
| 134 | +Reordering rows of a data frame (while preserving corresponding order |
| 135 | +of other columns) is normally a pain to do in R. |
| 136 | + |
| 137 | +```{r} |
| 138 | +chicago <- arrange(chicago, date) |
| 139 | +head(select(chicago, date, pm25tmean2), 3) |
| 140 | +tail(select(chicago, date, pm25tmean2), 3) |
| 141 | +``` |
| 142 | + |
| 143 | +# `arrange` |
| 144 | + |
| 145 | +Columns can be arranged in descending order too. |
| 146 | + |
| 147 | +```{r} |
| 148 | +chicago <- arrange(chicago, desc(date)) |
| 149 | +head(select(chicago, date, pm25tmean2), 3) |
| 150 | +tail(select(chicago, date, pm25tmean2), 3) |
| 151 | +``` |
| 152 | + |
| 153 | + |
| 154 | +# `rename` |
| 155 | + |
| 156 | +Renaming a variable in a data frame in R is surprising hard to do! |
| 157 | + |
| 158 | +```{r,tidy=FALSE} |
| 159 | +head(chicago[, 1:5], 3) |
| 160 | +chicago <- rename(chicago, dewpoint = dptp, |
| 161 | + pm25 = pm25tmean2) |
| 162 | +head(chicago[, 1:5], 3) |
| 163 | +``` |
| 164 | + |
| 165 | + |
| 166 | +# `mutate` |
| 167 | + |
| 168 | +```{r, tidy=FALSE} |
| 169 | +chicago <- mutate(chicago, |
| 170 | + pm25detrend=pm25-mean(pm25, na.rm=TRUE)) |
| 171 | +head(select(chicago, pm25, pm25detrend)) |
| 172 | +``` |
| 173 | + |
| 174 | +# `group_by` |
| 175 | + |
| 176 | +Generating summary statistics by stratum |
| 177 | + |
| 178 | +```{r, tidy=FALSE} |
| 179 | +chicago <- mutate(chicago, |
| 180 | + tempcat = factor(1 * (tmpd > 90), |
| 181 | + labels = c("cold", "hot"))) |
| 182 | +hotcold <- group_by(chicago, tempcat) |
| 183 | +summarize(hotcold, pm25 = mean(pm25, na.rm = TRUE), |
| 184 | + o3 = max(o3tmean2, na.rm = TRUE), |
| 185 | + no2 = median(no2tmean2, na.rm = TRUE)) |
| 186 | +``` |
| 187 | + |
| 188 | + |
| 189 | +# `group_by` |
| 190 | + |
| 191 | +Generating summary statistics by stratum |
| 192 | + |
| 193 | +```{r, tidy=FALSE} |
| 194 | +chicago <- mutate(chicago, |
| 195 | + year = as.POSIXlt(date)$year + 1900) |
| 196 | +years <- group_by(chicago, year) |
| 197 | +summarize(years, pm25 = mean(pm25, na.rm = TRUE), |
| 198 | + o3 = max(o3tmean2, na.rm = TRUE), |
| 199 | + no2 = median(no2tmean2, na.rm = TRUE)) |
| 200 | +``` |
| 201 | + |
| 202 | +```{r,echo=FALSE} |
| 203 | +chicago$year <- NULL ## Can't use mutate to create an existing variable |
| 204 | +``` |
| 205 | + |
| 206 | + |
| 207 | +# `%>%` |
| 208 | + |
| 209 | +```{r,tidy=FALSE,eval=FALSE} |
| 210 | +chicago %>% mutate(year = as.POSIXlt(date)$year + 1900) |
| 211 | + %>% group_by(year) |
| 212 | + %>% summarize(pm25 = mean(pm25, na.rm = TRUE), |
| 213 | + o3 = max(o3tmean2, na.rm = TRUE), |
| 214 | + no2 = median(no2tmean2, na.rm = TRUE)) |
| 215 | +``` |
| 216 | + |
| 217 | +```{r,echo=FALSE} |
| 218 | +chicago %>% mutate(year = as.POSIXlt(date)$year + 1900) %>% group_by(year) %>% |
| 219 | +summarize(pm25 = mean(pm25, na.rm = TRUE), o3 = max(o3tmean2, na.rm = TRUE), no2 = median(no2tmean2, na.rm = TRUE)) |
| 220 | +
|
| 221 | +``` |
| 222 | + |
| 223 | + |
| 224 | +# dplyr |
| 225 | + |
| 226 | +Once you learn the dplyr "grammar" there are a few additional benefits |
| 227 | + |
| 228 | +* dplyr can work with other data frame "backends" |
| 229 | + |
| 230 | +* `data.table` for large fast tables |
| 231 | + |
| 232 | +* SQL interface for relational databases via the DBI package |
| 233 | + |
| 234 | + |
0 commit comments