You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 01-data-hello.Rmd
+3-3Lines changed: 3 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -132,7 +132,7 @@ It is possible that the 8% difference in the stent study is due to this natural
132
132
However, the larger the difference we observe (for a particular sample size), the less believable it is that the difference is due to chance.
133
133
So, what we are really asking is the following: if in fact stents have no effect, how likely is it that we observe such a large difference?
134
134
135
-
While we don't yet have statistical tools to fully address this question on our own, we can comprehend the conclusions of the published analysis: there was compelling evidence of harm by stents in this study of stroke patients.
135
+
While we do not yet have statistical tools to fully address this question on our own, we can comprehend the conclusions of the published analysis: there was compelling evidence of harm by stents in this study of stroke patients.
136
136
137
137
**Be careful:** Do not generalize the results of this study to all patients and all stents.
138
138
This study looked at patients with very specific characteristics who volunteered to be a part of this study and who may not be representative of all stroke patients.
@@ -288,7 +288,7 @@ Examine the `unemployment_rate`, `pop2017`, `state`, and `median_edu` variables
288
288
Each of these variables is inherently different from the other three, yet some share certain characteristics.
289
289
290
290
First consider `unemployment_rate`, which is said to be a \index{numerical variable}**numerical** variable since it can take a wide range of numerical values, and it is sensible to add, subtract, or take averages with those values.
291
-
On the other hand, we would not classify a variable reporting telephone area codes as numerical since the average, sum, and difference of area codes doesn't have any clear meaning.
291
+
On the other hand, we would not classify a variable reporting telephone area codes as numerical since the average, sum, and difference of area codes does not have any clear meaning.
292
292
Instead, we would consider area codes as a categorical variable.
Researchers perform an **observational study** when they collect data in a way that does not directly interfere with how the data arise.
551
551
For instance, researchers may collect information via surveys, review medical or company records, or follow a **cohort** of many similar individuals to form hypotheses about why certain diseases might develop.
552
552
In each of these situations, researchers merely observe the data that arise.
553
-
In general, observational studies can provide evidence of a naturally occurring association between variables, but they cannot by themselves show a causal connection as they don't offer a mechanism for controlling for confounding variables.
553
+
In general, observational studies can provide evidence of a naturally occurring association between variables, but they cannot by themselves show a causal connection as they do not offer a mechanism for controlling for confounding variables.
Anecdotal evidence typically is composed of unusual cases that we recall based on their striking characteristics.
104
104
For instance, we are more likely to remember the two people we met who took 7 years to graduate than the six others who graduated in four years.
105
-
Instead of looking at the most unusual cases, we should examine a sample of many cases that better represent the population.
105
+
Instead, of looking at the most unusual cases, we should examine a sample of many cases that better represent the population.
106
106
107
107
### Sampling from a population
108
108
@@ -420,7 +420,7 @@ par(par_og) # restore original par
420
420
```
421
421
422
422
Sometimes cluster or multistage sampling can be more economical than the alternative sampling techniques.
423
-
Also, unlike stratified sampling, these approaches are most helpful when there is a lot of case-to-case variability within a cluster but the clusters themselves don't look very different from one another.
423
+
Also, unlike stratified sampling, these approaches are most helpful when there is a lot of case-to-case variability within a cluster but the clusters themselves do not look very different from one another.
424
424
For example, if neighborhoods represented clusters, then cluster or multistage sampling work best when the populations inside each neighborhood are very diverse.
425
425
A downside of these methods is that more advanced techniques are typically required to analyze the data, though the methods in this book can be extended to handle such data.
Put yourself in the place of a person in the study.
530
530
If you are in the treatment group, you are given a fancy new drug that you anticipate will help you.
531
-
On the other hand, a person in the other group doesn't receive the drug and sits idly, hoping her participation doesn't increase her risk of death.
531
+
On the other hand, a person in the other group does not receive the drug and sits idly, hoping her participation does not increase her risk of death.
532
532
These perspectives suggest there are actually two effects in this study: the one of interest is the effectiveness of the drug, and the second is an emotional effect of (not) taking the drug, which is difficult to quantify.
533
533
534
534
Researchers aren't usually interested in the emotional effect, which might bias the study.
535
535
To circumvent this problem, researchers do not want patients to know which group they are in.
536
536
When researchers keep the patients uninformed about their treatment, the study is said to be **blind**.
537
-
But there is one problem: if a patient doesn't receive a treatment, they will know they're in the control group.
537
+
But there is one problem: if a patient does not receive a treatment, they will know they're in the control group.
538
538
A solution to this problem is to give a fake treatment to patients in the control group.
539
539
This is called a **placebo**, and an effective placebo is the key to making a study truly blind.
540
540
A classic example of a placebo is a sugar pill that is made to look like the actual treatment pill.
@@ -583,7 +583,7 @@ These questions may have even arisen in your mind when in the general experiment
583
583
584
584
There are always multiple viewpoints of experiments and placebos, and rarely is it obvious which is ethically "correct".
585
585
For instance, is it ethical to use a sham surgery when it creates a risk to the patient?
586
-
However, if we don't use sham surgeries, we may promote the use of a costly treatment that has no real effect; if this happens, money and other resources will be diverted away from other treatments that are known to be helpful.
586
+
However, if we do not use sham surgeries, we may promote the use of a costly treatment that has no real effect; if this happens, money and other resources will be diverted away from other treatments that are known to be helpful.
587
587
Ultimately, this is a difficult situation where we cannot perfectly protect both the patients who have volunteered for the study and the patients who may benefit (or not) from the treatment in the future.
Copy file name to clipboardExpand all lines: 04-explore-categorical.Rmd
+11-11Lines changed: 11 additions & 11 deletions
Original file line number
Diff line number
Diff line change
@@ -83,7 +83,7 @@ A bar plot is a common way to display a single categorical variable.
83
83
The left panel of Figure \@ref(fig:loan-homeownership-bar-plot) shows a **bar plot** for the `homeownership` variable.
84
84
In the right panel, the counts are converted into proportions, showing the proportion of observations that are in each level.
85
85
86
-
```{r loan-homeownership-bar-plot, fig.cap = "Two bar plots: the left panel shows the counts and the right panel shows the proportions of values of the homeownership variable.", fig.asp=0.5}
86
+
```{r loan-homeownership-bar-plot, fig.cap = "Two bar plots: the left panel shows the counts, and the right panel shows the proportions of values of the homeownership variable.", fig.asp=0.5}
The stacked bar plot is most useful when it's reasonable to assign one variable as the explanatory variable (here `homeownership`) and the other variable as the response (here `application_type`), since we are effectively grouping by one variable first and then breaking it down by the others.
149
+
The stacked bar plot is most useful when it's reasonable to assign one variable as the explanatory variable (here `homeownership`) and the other variable as the response (here `application_type`) since we are effectively grouping by one variable first and then breaking it down by the others.
150
150
151
151
Dodged bar plots are more agnostic in their display about which variable, if any, represents the explanatory and which the response variable.
152
152
It is also easy to discern the number of cases in each of the six different group combinations.
@@ -196,7 +196,7 @@ p_mosaic_1 + p_mosaic_2 +
196
196
197
197
In Figure \@ref(fig:loan-homeownership-type-mosaic-plot), we chose to first split by the homeowner status of the borrower.
198
198
However, we could have instead first split by the application type, as in Figure \@ref(fig:loan-app-type-mosaic-plot).
199
-
Like with the bar plots, it's common to use the explanatory variable to represent the first split in a mosaic plot, and then for the response to break up each level of the explanatory variable, if these labels are reasonable to attach to the variables under consideration.
199
+
Like with the bar plots, it's common to use the explanatory variable to represent the first split in a mosaic plot, and then for the response to break up each level of the explanatory variable if these labels are reasonable to attach to the variables under consideration.
200
200
201
201
```{r loan-app-type-mosaic-plot, fig.cap = "Mosaic plot where loans are grouped by homeownership after they have been divided into individual and joint application types."}
202
202
ggplot(loans) +
@@ -213,7 +213,7 @@ However, we have not discussed how the values in the bar and mosaic plots that s
213
213
In this section we will investigate fractional breakdown of one variable in another and we can modify our contingency table to provide such a view.
214
214
Table \@ref(tab:loan-home-app-type-row-proportions) shows **row proportions** for Table \@ref(tab:loan-home-app-type-totals), which are computed as the counts divided by their row totals.
215
215
The value 3496 at the intersection of individual and rent is replaced by $3496 / 8505 = 0.411,$ i.e., 3496 divided by its row total, 8505.
216
-
So what does 0.411 represent?
216
+
So, what does 0.411 represent?
217
217
It corresponds to the proportion of individual applicants who rent.
@@ -280,7 +280,7 @@ What does 0.135 represent in Table \@ref(tab:loan-home-app-type-column-proportio
280
280
Data scientists use statistics to build email spam filters.
281
281
By noting specific characteristics of an email, a data scientist may be able to classify some emails as spam or not spam with high accuracy.
282
282
One such characteristic is whether the email contains no numbers, small numbers, or big numbers.
283
-
Another characteristic is the email format, which indicates whether or not an email has any HTML content, such as bolded text.
283
+
Another characteristic is the email format, which indicates whether an email has any HTML content, such as bolded text.
284
284
We'll focus on email format and spam status using the dataset; these variables are summarized in a contingency table in Table \@ref(tab:email-count-table).
285
285
Which would be more helpful to someone hoping to classify email as spam or regular email for this table: row or column proportions?
286
286
@@ -322,7 +322,7 @@ Are there any obvious scenarios where one might be more useful than the other?
What is distinct about the email example is that the two loan variables don't have a clear explanatory-response variable relationship that we might hypothesize.
325
+
What is distinct about the email example is that the two loan variables do not have a clear explanatory-response variable relationship that we might hypothesize.
326
326
Usually it is most useful to "condition" on the explanatory variable.
327
327
For instance, in the email example, the email format was seen as a possible explanatory variable of whether the message was spam, so we would find it more interesting to compute the relative frequencies (proportions) for each email format.
328
328
:::
@@ -358,7 +358,7 @@ p_pie + p_bar
358
358
```
359
359
360
360
Pie charts can work well when the goal is to visualize a categorical variable with very few levels, and especially if each level represents a simple fraction (e.g., one-half, one-quarter, etc.).
361
-
However they can be quite difficult to read when they are used to visualize a categorical variable with many levels.
361
+
However, they can be quite difficult to read when they are used to visualize a categorical variable with many levels.
362
362
For example, the pie chart and the bar plot in Figure \@ref(fig:loan-grade-pie-chart) both represent the distribution of loan grades (A through G).
363
363
In this case, it is far easier to compare the counts of each loan grade using the bar plot than the pie chart.
364
364
@@ -391,7 +391,7 @@ Just like with pie charts, they work best when the number of levels represented
391
391
However, unlike pie charts, they can make it easier to compare proportions that represent non-simple fractions.
392
392
Figure \@ref(fig:loan-waffle) displays two examples of waffle charts: one for the distribution of homeownership and the other for the distribution of loan status.
393
393
394
-
```{r loan-waffle, fig.cap = "Plot A: Waffle chart of homeownership, with levels rent, morgage, and own. Plot B: Waffle chart of loan status, with levels current, fully paid, in grade period, and late.", fig.asp = 0.5, fig.width=8}
394
+
```{r loan-waffle, fig.cap = "Plot A: Waffle chart of homeownership, with levels rent, mortgage, and own. Plot B: Waffle chart of loan status, with levels current, fully paid, in grade period, and late.", fig.asp = 0.5, fig.width=8}
395
395
p_waffle_homeownership <- loans %>%
396
396
count(homeownership) %>%
397
397
ggplot(aes(fill = homeownership, values = n)) +
@@ -421,7 +421,7 @@ p_waffle_homeownership +
421
421
## Comparing numerical data across groups
422
422
423
423
Some of the more interesting investigations can be considered by examining numerical data across groups.
424
-
In this section we will expand on a few methods we've already seen to make plots for numerical data from multiple groups on the same graph as well as introduce a few new methods for comparing numerical data across groups.
424
+
In this section we will expand on a few methods we have already seen to make plots for numerical data from multiple groups on the same graph as well as introduce a few new methods for comparing numerical data across groups.
425
425
426
426
We will revisit the `county` dataset and compare the median household income for counties that gained population from 2010 to 2017 versus counties that had no gain.
427
427
While we might like to make a causal connection between income and population growth, remember that these are observational data and so such an interpretation would be, at best, half-baked.
@@ -639,7 +639,7 @@ Based on Figure \@ref(fig:countyIncomeRidgeMulti), what can you say about how me
639
639
### Summary
640
640
641
641
Fluently working with categorical variables is an important skill for data analysts.
642
-
In this chapter we've introduced different visualizations and numerical summaries applied to categorical variables.
642
+
In this chapter we have introduced different visualizations and numerical summaries applied to categorical variables.
643
643
The graphical visualizations are even more descriptive when two variables are presented simultaneously.
644
644
We presented bar plots, mosaic plots, pie charts, and estimations of conditional proportions.
645
645
@@ -648,7 +648,7 @@ We presented bar plots, mosaic plots, pie charts, and estimations of conditional
648
648
We introduced the following terms in the chapter.
649
649
If you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.
650
650
We are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.
651
-
However you should be able to easily spot them as **bolded text**.
651
+
However, you should be able to easily spot them as **bolded text**.
0 commit comments