DataScienceSpecialization
diff --git a/‎03_GettingData/dplyr/chicago.rds
127 KB b/‎03_GettingData/dplyr/chicago.rds
127 KB
diff --git a/‎03_GettingData/dplyr/dplyr.Rmd
Lines changed: 234 additions & 0 deletions b/‎03_GettingData/dplyr/dplyr.Rmd
Lines changed: 234 additions & 0 deletions
@@ -0,0 +1,234 @@
+% Managing Data Frames with `dplyr`
+% Biostatistics 140.776
+%
+
+```{r, echo=FALSE, results="hide"}
+options(width = 50)
+```
+
+# dplyr
+
+The data frame is a key data structure in statistics and in R.
+
+* There is one observation per row
+
+* Each column represents a variable or measure or characteristic
+
+* Primary implementation that you will use is the default R
+  implementation
+
+* Other implementations, particularly relational databases systems
+
+
+# dplyr
+
+* Developed by Hadley Wickham of RStudio
+
+* An optimized and distilled version of `plyr` package (also by Hadley)
+
+* Does not provide any "new" functionality per se, but **greatly**
+  simplifies existing functionality in R
+
+* Provides a "grammar" (in particular, verbs) for data manipulation
+
+* Is **very** fast, as many key operations are coded in C++
+
+
+# dplyr Verbs
+
+* `select`: return a subset of the columns of a data frame
+
+* `filter`: extract a subset of rows from a data frame based on
+  logical conditions
+
+* `arrange`: reorder rows of a data frame
+
+
+* `rename`: rename variables in a data frame
+
+* `mutate`: add new variables/columns or transform existing variables
+
+* `summarise` / `summarize`: generate summary statistics of different
+  variables in the data frame, possibly within strata
+
+There is also a handy `print` method that prevents you from printing a
+lot of data to the console.
+
+
+
+# dplyr Properties
+
+* The first argument is a data frame.
+
+* The subsequent arguments describe what to do with it, and you can
+  refer to columns in the data frame directly without using the $
+  operator (just use the names).
+
+* The result is a new data frame
+
+* Data frames must be properly formatted and annotated for this to all
+  be useful
+
+
+# Load the `dplyr` package
+
+
+This step is important!
+
+```{r}
+library(dplyr)
+```
+
+
+# `select`
+
+```{r}
+chicago <- readRDS("chicago.rds")
+dim(chicago)
+head(select(chicago, 1:5))
+```
+
+
+# `select`
+
+```{r}
+names(chicago)[1:3]
+head(select(chicago, city:dptp))
+```
+
+# `select`
+
+In dplyr you can do
+
+```{r,eval=FALSE}
+head(select(chicago, -(city:dptp)))
+```
+
+Equivalent base R
+
+```{r,eval=FALSE}
+i <- match("city", names(chicago))
+j <- match("dptp", names(chicago))
+head(chicago[, -(i:j)])
+```
+
+
+
+# `filter`
+
+```{r}
+chic.f <- filter(chicago, pm25tmean2 > 30)
+head(select(chic.f, 1:3, pm25tmean2), 10)
+```
+
+# `filter`
+
+```{r}
+chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
+head(select(chic.f, 1:3, pm25tmean2, tmpd), 10)
+```
+
+
+# `arrange`
+
+Reordering rows of a data frame (while preserving corresponding order
+of other columns) is normally a pain to do in R.
+
+```{r}
+chicago <- arrange(chicago, date)
+head(select(chicago, date, pm25tmean2), 3)
+tail(select(chicago, date, pm25tmean2), 3)
+```
+
+# `arrange`
+
+Columns can be arranged in descending order too.
+
+```{r}
+chicago <- arrange(chicago, desc(date))
+head(select(chicago, date, pm25tmean2), 3)
+tail(select(chicago, date, pm25tmean2), 3)
+```
+
+
+# `rename`
+
+Renaming a variable in a data frame in R is surprising hard to do!
+
+```{r,tidy=FALSE}
+head(chicago[, 1:5], 3)
+chicago <- rename(chicago, dewpoint = dptp, 
+                  pm25 = pm25tmean2)
+head(chicago[, 1:5], 3)
+```
+
+
+# `mutate`
+
+```{r, tidy=FALSE}
+chicago <- mutate(chicago, 
+                  pm25detrend=pm25-mean(pm25, na.rm=TRUE))
+head(select(chicago, pm25, pm25detrend))
+```
+
+# `group_by`
+
+Generating summary statistics by stratum
+
+```{r, tidy=FALSE}
+chicago <- mutate(chicago, 
+                  tempcat = factor(1 * (tmpd > 90), 
+                                   labels = c("cold", "hot")))
+hotcold <- group_by(chicago, tempcat)
+summarize(hotcold, pm25 = mean(pm25, na.rm = TRUE), 
+          o3 = max(o3tmean2, na.rm = TRUE), 
+          no2 = median(no2tmean2, na.rm = TRUE))
+```
+
+
+# `group_by`
+
+Generating summary statistics by stratum
+
+```{r, tidy=FALSE}
+chicago <- mutate(chicago, 
+                  year = as.POSIXlt(date)$year + 1900)
+years <- group_by(chicago, year)
+summarize(years, pm25 = mean(pm25, na.rm = TRUE), 
+          o3 = max(o3tmean2, na.rm = TRUE), 
+          no2 = median(no2tmean2, na.rm = TRUE))
+```
+
+```{r,echo=FALSE}
+chicago$year <- NULL  ## Can't use mutate to create an existing variable
+```
+
+
+# `%>%`
+
+```{r,tidy=FALSE,eval=FALSE}
+chicago %>% mutate(year = as.POSIXlt(date)$year + 1900) 
+	%>% group_by(year) 
+	%>% summarize(pm25 = mean(pm25, na.rm = TRUE), 
+          o3 = max(o3tmean2, na.rm = TRUE), 
+          no2 = median(no2tmean2, na.rm = TRUE))
+```
+
+```{r,echo=FALSE}
+chicago %>% mutate(year = as.POSIXlt(date)$year + 1900) %>% group_by(year) %>% 
+summarize(pm25 = mean(pm25, na.rm = TRUE), o3 = max(o3tmean2, na.rm = TRUE), no2 = median(no2tmean2, na.rm = TRUE))
+
+```
+
+
+# dplyr
+
+Once you learn the dplyr "grammar" there are a few additional benefits
+
+* dplyr can work with other data frame "backends"
+
+* `data.table` for large fast tables
+
+* SQL interface for relational databases via the DBI package
+
+