Skip to content

Commit 1301a45

Browse files
committed
Add dplyr material
1 parent 493a62f commit 1301a45

File tree

4 files changed

+668
-0
lines changed

4 files changed

+668
-0
lines changed

03_GettingData/dplyr/chicago.rds

127 KB
Binary file not shown.

03_GettingData/dplyr/dplyr.Rmd

Lines changed: 234 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,234 @@
1+
% Managing Data Frames with `dplyr`
2+
% Biostatistics 140.776
3+
%
4+
5+
```{r, echo=FALSE, results="hide"}
6+
options(width = 50)
7+
```
8+
9+
# dplyr
10+
11+
The data frame is a key data structure in statistics and in R.
12+
13+
* There is one observation per row
14+
15+
* Each column represents a variable or measure or characteristic
16+
17+
* Primary implementation that you will use is the default R
18+
implementation
19+
20+
* Other implementations, particularly relational databases systems
21+
22+
23+
# dplyr
24+
25+
* Developed by Hadley Wickham of RStudio
26+
27+
* An optimized and distilled version of `plyr` package (also by Hadley)
28+
29+
* Does not provide any "new" functionality per se, but **greatly**
30+
simplifies existing functionality in R
31+
32+
* Provides a "grammar" (in particular, verbs) for data manipulation
33+
34+
* Is **very** fast, as many key operations are coded in C++
35+
36+
37+
# dplyr Verbs
38+
39+
* `select`: return a subset of the columns of a data frame
40+
41+
* `filter`: extract a subset of rows from a data frame based on
42+
logical conditions
43+
44+
* `arrange`: reorder rows of a data frame
45+
46+
47+
* `rename`: rename variables in a data frame
48+
49+
* `mutate`: add new variables/columns or transform existing variables
50+
51+
* `summarise` / `summarize`: generate summary statistics of different
52+
variables in the data frame, possibly within strata
53+
54+
There is also a handy `print` method that prevents you from printing a
55+
lot of data to the console.
56+
57+
58+
59+
# dplyr Properties
60+
61+
* The first argument is a data frame.
62+
63+
* The subsequent arguments describe what to do with it, and you can
64+
refer to columns in the data frame directly without using the $
65+
operator (just use the names).
66+
67+
* The result is a new data frame
68+
69+
* Data frames must be properly formatted and annotated for this to all
70+
be useful
71+
72+
73+
# Load the `dplyr` package
74+
75+
76+
This step is important!
77+
78+
```{r}
79+
library(dplyr)
80+
```
81+
82+
83+
# `select`
84+
85+
```{r}
86+
chicago <- readRDS("chicago.rds")
87+
dim(chicago)
88+
head(select(chicago, 1:5))
89+
```
90+
91+
92+
# `select`
93+
94+
```{r}
95+
names(chicago)[1:3]
96+
head(select(chicago, city:dptp))
97+
```
98+
99+
# `select`
100+
101+
In dplyr you can do
102+
103+
```{r,eval=FALSE}
104+
head(select(chicago, -(city:dptp)))
105+
```
106+
107+
Equivalent base R
108+
109+
```{r,eval=FALSE}
110+
i <- match("city", names(chicago))
111+
j <- match("dptp", names(chicago))
112+
head(chicago[, -(i:j)])
113+
```
114+
115+
116+
117+
# `filter`
118+
119+
```{r}
120+
chic.f <- filter(chicago, pm25tmean2 > 30)
121+
head(select(chic.f, 1:3, pm25tmean2), 10)
122+
```
123+
124+
# `filter`
125+
126+
```{r}
127+
chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
128+
head(select(chic.f, 1:3, pm25tmean2, tmpd), 10)
129+
```
130+
131+
132+
# `arrange`
133+
134+
Reordering rows of a data frame (while preserving corresponding order
135+
of other columns) is normally a pain to do in R.
136+
137+
```{r}
138+
chicago <- arrange(chicago, date)
139+
head(select(chicago, date, pm25tmean2), 3)
140+
tail(select(chicago, date, pm25tmean2), 3)
141+
```
142+
143+
# `arrange`
144+
145+
Columns can be arranged in descending order too.
146+
147+
```{r}
148+
chicago <- arrange(chicago, desc(date))
149+
head(select(chicago, date, pm25tmean2), 3)
150+
tail(select(chicago, date, pm25tmean2), 3)
151+
```
152+
153+
154+
# `rename`
155+
156+
Renaming a variable in a data frame in R is surprising hard to do!
157+
158+
```{r,tidy=FALSE}
159+
head(chicago[, 1:5], 3)
160+
chicago <- rename(chicago, dewpoint = dptp,
161+
pm25 = pm25tmean2)
162+
head(chicago[, 1:5], 3)
163+
```
164+
165+
166+
# `mutate`
167+
168+
```{r, tidy=FALSE}
169+
chicago <- mutate(chicago,
170+
pm25detrend=pm25-mean(pm25, na.rm=TRUE))
171+
head(select(chicago, pm25, pm25detrend))
172+
```
173+
174+
# `group_by`
175+
176+
Generating summary statistics by stratum
177+
178+
```{r, tidy=FALSE}
179+
chicago <- mutate(chicago,
180+
tempcat = factor(1 * (tmpd > 90),
181+
labels = c("cold", "hot")))
182+
hotcold <- group_by(chicago, tempcat)
183+
summarize(hotcold, pm25 = mean(pm25, na.rm = TRUE),
184+
o3 = max(o3tmean2, na.rm = TRUE),
185+
no2 = median(no2tmean2, na.rm = TRUE))
186+
```
187+
188+
189+
# `group_by`
190+
191+
Generating summary statistics by stratum
192+
193+
```{r, tidy=FALSE}
194+
chicago <- mutate(chicago,
195+
year = as.POSIXlt(date)$year + 1900)
196+
years <- group_by(chicago, year)
197+
summarize(years, pm25 = mean(pm25, na.rm = TRUE),
198+
o3 = max(o3tmean2, na.rm = TRUE),
199+
no2 = median(no2tmean2, na.rm = TRUE))
200+
```
201+
202+
```{r,echo=FALSE}
203+
chicago$year <- NULL ## Can't use mutate to create an existing variable
204+
```
205+
206+
207+
# `%>%`
208+
209+
```{r,tidy=FALSE,eval=FALSE}
210+
chicago %>% mutate(year = as.POSIXlt(date)$year + 1900)
211+
%>% group_by(year)
212+
%>% summarize(pm25 = mean(pm25, na.rm = TRUE),
213+
o3 = max(o3tmean2, na.rm = TRUE),
214+
no2 = median(no2tmean2, na.rm = TRUE))
215+
```
216+
217+
```{r,echo=FALSE}
218+
chicago %>% mutate(year = as.POSIXlt(date)$year + 1900) %>% group_by(year) %>%
219+
summarize(pm25 = mean(pm25, na.rm = TRUE), o3 = max(o3tmean2, na.rm = TRUE), no2 = median(no2tmean2, na.rm = TRUE))
220+
221+
```
222+
223+
224+
# dplyr
225+
226+
Once you learn the dplyr "grammar" there are a few additional benefits
227+
228+
* dplyr can work with other data frame "backends"
229+
230+
* `data.table` for large fast tables
231+
232+
* SQL interface for relational databases via the DBI package
233+
234+

0 commit comments

Comments
 (0)