Skip to content

Commit 4b01996

Browse files
Stefán Örvar Sigmundssonmine-cetinkaya-rundel
Stefán Örvar Sigmundsson
andauthored
Various spelling/grammar corrections and conciseness recommendations. (#228)
* Update preface.Rmd Spelling/Grammar. * Update 01-data-hello.Rmd Spelling/Grammar/Conciseness. * Update 01-ex-data-hello.Rmd Spelling/Grammar/Conciseness. * Update 02-data-design.Rmd Spelling/Grammar/Conciseness. * Update 02-ex-data-design.Rmd Spelling/Grammar/Conciseness. * Update 03-data-applications.Rmd Spelling/Grammar/Conciseness. * Various spelling/grammar corrections and conciseness recommendations. Co-authored-by: Mine Çetinkaya-Rundel <[email protected]>
1 parent 87cf557 commit 4b01996

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+199
-199
lines changed

01-data-hello.Rmd

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,7 @@ It is possible that the 8% difference in the stent study is due to this natural
132132
However, the larger the difference we observe (for a particular sample size), the less believable it is that the difference is due to chance.
133133
So, what we are really asking is the following: if in fact stents have no effect, how likely is it that we observe such a large difference?
134134

135-
While we don't yet have statistical tools to fully address this question on our own, we can comprehend the conclusions of the published analysis: there was compelling evidence of harm by stents in this study of stroke patients.
135+
While we do not yet have statistical tools to fully address this question on our own, we can comprehend the conclusions of the published analysis: there was compelling evidence of harm by stents in this study of stroke patients.
136136

137137
**Be careful:** Do not generalize the results of this study to all patients and all stents.
138138
This study looked at patients with very specific characteristics who volunteered to be a part of this study and who may not be representative of all stroke patients.
@@ -288,7 +288,7 @@ Examine the `unemployment_rate`, `pop2017`, `state`, and `median_edu` variables
288288
Each of these variables is inherently different from the other three, yet some share certain characteristics.
289289

290290
First consider `unemployment_rate`, which is said to be a \index{numerical variable}**numerical** variable since it can take a wide range of numerical values, and it is sensible to add, subtract, or take averages with those values.
291-
On the other hand, we would not classify a variable reporting telephone area codes as numerical since the average, sum, and difference of area codes doesn't have any clear meaning.
291+
On the other hand, we would not classify a variable reporting telephone area codes as numerical since the average, sum, and difference of area codes does not have any clear meaning.
292292
Instead, we would consider area codes as a categorical variable.
293293

294294
```{r include=FALSE}
@@ -550,7 +550,7 @@ terms_chp_1 <- c(terms_chp_1, "experiment", "randomized experiment", "placebo")
550550
Researchers perform an **observational study** when they collect data in a way that does not directly interfere with how the data arise.
551551
For instance, researchers may collect information via surveys, review medical or company records, or follow a **cohort** of many similar individuals to form hypotheses about why certain diseases might develop.
552552
In each of these situations, researchers merely observe the data that arise.
553-
In general, observational studies can provide evidence of a naturally occurring association between variables, but they cannot by themselves show a causal connection as they don't offer a mechanism for controlling for confounding variables.
553+
In general, observational studies can provide evidence of a naturally occurring association between variables, but they cannot by themselves show a causal connection as they do not offer a mechanism for controlling for confounding variables.
554554

555555
```{r include=FALSE}
556556
terms_chp_1 <- c(terms_chp_1, "observational study", "cohort")

02-data-design.Rmd

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ include_graphics("images/mn-winter/mn-winter.jpg")
102102

103103
Anecdotal evidence typically is composed of unusual cases that we recall based on their striking characteristics.
104104
For instance, we are more likely to remember the two people we met who took 7 years to graduate than the six others who graduated in four years.
105-
Instead of looking at the most unusual cases, we should examine a sample of many cases that better represent the population.
105+
Instead, of looking at the most unusual cases, we should examine a sample of many cases that better represent the population.
106106

107107
### Sampling from a population
108108

@@ -420,7 +420,7 @@ par(par_og) # restore original par
420420
```
421421

422422
Sometimes cluster or multistage sampling can be more economical than the alternative sampling techniques.
423-
Also, unlike stratified sampling, these approaches are most helpful when there is a lot of case-to-case variability within a cluster but the clusters themselves don't look very different from one another.
423+
Also, unlike stratified sampling, these approaches are most helpful when there is a lot of case-to-case variability within a cluster but the clusters themselves do not look very different from one another.
424424
For example, if neighborhoods represented clusters, then cluster or multistage sampling work best when the populations inside each neighborhood are very diverse.
425425
A downside of these methods is that more advanced techniques are typically required to analyze the data, though the methods in this book can be extended to handle such data.
426426

@@ -528,13 +528,13 @@ terms_chp_2 <- c(terms_chp_2, "treatment group", "control group")
528528

529529
Put yourself in the place of a person in the study.
530530
If you are in the treatment group, you are given a fancy new drug that you anticipate will help you.
531-
On the other hand, a person in the other group doesn't receive the drug and sits idly, hoping her participation doesn't increase her risk of death.
531+
On the other hand, a person in the other group does not receive the drug and sits idly, hoping her participation does not increase her risk of death.
532532
These perspectives suggest there are actually two effects in this study: the one of interest is the effectiveness of the drug, and the second is an emotional effect of (not) taking the drug, which is difficult to quantify.
533533

534534
Researchers aren't usually interested in the emotional effect, which might bias the study.
535535
To circumvent this problem, researchers do not want patients to know which group they are in.
536536
When researchers keep the patients uninformed about their treatment, the study is said to be **blind**.
537-
But there is one problem: if a patient doesn't receive a treatment, they will know they're in the control group.
537+
But there is one problem: if a patient does not receive a treatment, they will know they're in the control group.
538538
A solution to this problem is to give a fake treatment to patients in the control group.
539539
This is called a **placebo**, and an effective placebo is the key to making a study truly blind.
540540
A classic example of a placebo is a sugar pill that is made to look like the actual treatment pill.
@@ -583,7 +583,7 @@ These questions may have even arisen in your mind when in the general experiment
583583

584584
There are always multiple viewpoints of experiments and placebos, and rarely is it obvious which is ethically "correct".
585585
For instance, is it ethical to use a sham surgery when it creates a risk to the patient?
586-
However, if we don't use sham surgeries, we may promote the use of a costly treatment that has no real effect; if this happens, money and other resources will be diverted away from other treatments that are known to be helpful.
586+
However, if we do not use sham surgeries, we may promote the use of a costly treatment that has no real effect; if this happens, money and other resources will be diverted away from other treatments that are known to be helpful.
587587
Ultimately, this is a difficult situation where we cannot perfectly protect both the patients who have volunteered for the study and the patients who may benefit (or not) from the treatment in the future.
588588

589589
## Observational studies

03-data-applications.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ passwords_var_def %>%
9393
column_spec(2, width = "30em")
9494
```
9595

96-
We now have a better sense of what each column represents, but we don't yet know much about the characteristics of each of the variables.
96+
We now have a better sense of what each column represents, but we do not yet know much about the characteristics of each of the variables.
9797

9898
::: {.workedexample data-latex=""}
9999
Determine whether each variable in the passwords dataset is numerical or categorical.

04-explore-categorical.Rmd

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ A bar plot is a common way to display a single categorical variable.
8383
The left panel of Figure \@ref(fig:loan-homeownership-bar-plot) shows a **bar plot** for the `homeownership` variable.
8484
In the right panel, the counts are converted into proportions, showing the proportion of observations that are in each level.
8585

86-
```{r loan-homeownership-bar-plot, fig.cap = "Two bar plots: the left panel shows the counts and the right panel shows the proportions of values of the homeownership variable.", fig.asp=0.5}
86+
```{r loan-homeownership-bar-plot, fig.cap = "Two bar plots: the left panel shows the counts, and the right panel shows the proportions of values of the homeownership variable.", fig.asp=0.5}
8787
p_count <- ggplot(loans, aes(x = homeownership)) +
8888
geom_bar(fill = IMSCOL["green", "full"]) +
8989
labs(x = "Homeownership", y = "Count")
@@ -146,7 +146,7 @@ When is the stacked, dodged, or standardized bar plot the most useful?
146146

147147
------------------------------------------------------------------------
148148

149-
The stacked bar plot is most useful when it's reasonable to assign one variable as the explanatory variable (here `homeownership`) and the other variable as the response (here `application_type`), since we are effectively grouping by one variable first and then breaking it down by the others.
149+
The stacked bar plot is most useful when it's reasonable to assign one variable as the explanatory variable (here `homeownership`) and the other variable as the response (here `application_type`) since we are effectively grouping by one variable first and then breaking it down by the others.
150150

151151
Dodged bar plots are more agnostic in their display about which variable, if any, represents the explanatory and which the response variable.
152152
It is also easy to discern the number of cases in each of the six different group combinations.
@@ -196,7 +196,7 @@ p_mosaic_1 + p_mosaic_2 +
196196

197197
In Figure \@ref(fig:loan-homeownership-type-mosaic-plot), we chose to first split by the homeowner status of the borrower.
198198
However, we could have instead first split by the application type, as in Figure \@ref(fig:loan-app-type-mosaic-plot).
199-
Like with the bar plots, it's common to use the explanatory variable to represent the first split in a mosaic plot, and then for the response to break up each level of the explanatory variable, if these labels are reasonable to attach to the variables under consideration.
199+
Like with the bar plots, it's common to use the explanatory variable to represent the first split in a mosaic plot, and then for the response to break up each level of the explanatory variable if these labels are reasonable to attach to the variables under consideration.
200200

201201
```{r loan-app-type-mosaic-plot, fig.cap = "Mosaic plot where loans are grouped by homeownership after they have been divided into individual and joint application types."}
202202
ggplot(loans) +
@@ -213,7 +213,7 @@ However, we have not discussed how the values in the bar and mosaic plots that s
213213
In this section we will investigate fractional breakdown of one variable in another and we can modify our contingency table to provide such a view.
214214
Table \@ref(tab:loan-home-app-type-row-proportions) shows **row proportions** for Table \@ref(tab:loan-home-app-type-totals), which are computed as the counts divided by their row totals.
215215
The value 3496 at the intersection of individual and rent is replaced by $3496 / 8505 = 0.411,$ i.e., 3496 divided by its row total, 8505.
216-
So what does 0.411 represent?
216+
So, what does 0.411 represent?
217217
It corresponds to the proportion of individual applicants who rent.
218218

219219
```{r loan-home-app-type-row-proportions, out.width = "70%"}
@@ -280,7 +280,7 @@ What does 0.135 represent in Table \@ref(tab:loan-home-app-type-column-proportio
280280
Data scientists use statistics to build email spam filters.
281281
By noting specific characteristics of an email, a data scientist may be able to classify some emails as spam or not spam with high accuracy.
282282
One such characteristic is whether the email contains no numbers, small numbers, or big numbers.
283-
Another characteristic is the email format, which indicates whether or not an email has any HTML content, such as bolded text.
283+
Another characteristic is the email format, which indicates whether an email has any HTML content, such as bolded text.
284284
We'll focus on email format and spam status using the dataset; these variables are summarized in a contingency table in Table \@ref(tab:email-count-table).
285285
Which would be more helpful to someone hoping to classify email as spam or regular email for this table: row or column proportions?
286286

@@ -322,7 +322,7 @@ Are there any obvious scenarios where one might be more useful than the other?
322322
------------------------------------------------------------------------
323323

324324
None that we think are obvious!
325-
What is distinct about the email example is that the two loan variables don't have a clear explanatory-response variable relationship that we might hypothesize.
325+
What is distinct about the email example is that the two loan variables do not have a clear explanatory-response variable relationship that we might hypothesize.
326326
Usually it is most useful to "condition" on the explanatory variable.
327327
For instance, in the email example, the email format was seen as a possible explanatory variable of whether the message was spam, so we would find it more interesting to compute the relative frequencies (proportions) for each email format.
328328
:::
@@ -358,7 +358,7 @@ p_pie + p_bar
358358
```
359359

360360
Pie charts can work well when the goal is to visualize a categorical variable with very few levels, and especially if each level represents a simple fraction (e.g., one-half, one-quarter, etc.).
361-
However they can be quite difficult to read when they are used to visualize a categorical variable with many levels.
361+
However, they can be quite difficult to read when they are used to visualize a categorical variable with many levels.
362362
For example, the pie chart and the bar plot in Figure \@ref(fig:loan-grade-pie-chart) both represent the distribution of loan grades (A through G).
363363
In this case, it is far easier to compare the counts of each loan grade using the bar plot than the pie chart.
364364

@@ -391,7 +391,7 @@ Just like with pie charts, they work best when the number of levels represented
391391
However, unlike pie charts, they can make it easier to compare proportions that represent non-simple fractions.
392392
Figure \@ref(fig:loan-waffle) displays two examples of waffle charts: one for the distribution of homeownership and the other for the distribution of loan status.
393393

394-
```{r loan-waffle, fig.cap = "Plot A: Waffle chart of homeownership, with levels rent, morgage, and own. Plot B: Waffle chart of loan status, with levels current, fully paid, in grade period, and late.", fig.asp = 0.5, fig.width=8}
394+
```{r loan-waffle, fig.cap = "Plot A: Waffle chart of homeownership, with levels rent, mortgage, and own. Plot B: Waffle chart of loan status, with levels current, fully paid, in grade period, and late.", fig.asp = 0.5, fig.width=8}
395395
p_waffle_homeownership <- loans %>%
396396
count(homeownership) %>%
397397
ggplot(aes(fill = homeownership, values = n)) +
@@ -421,7 +421,7 @@ p_waffle_homeownership +
421421
## Comparing numerical data across groups
422422

423423
Some of the more interesting investigations can be considered by examining numerical data across groups.
424-
In this section we will expand on a few methods we've already seen to make plots for numerical data from multiple groups on the same graph as well as introduce a few new methods for comparing numerical data across groups.
424+
In this section we will expand on a few methods we have already seen to make plots for numerical data from multiple groups on the same graph as well as introduce a few new methods for comparing numerical data across groups.
425425

426426
We will revisit the `county` dataset and compare the median household income for counties that gained population from 2010 to 2017 versus counties that had no gain.
427427
While we might like to make a causal connection between income and population growth, remember that these are observational data and so such an interpretation would be, at best, half-baked.
@@ -639,7 +639,7 @@ Based on Figure \@ref(fig:countyIncomeRidgeMulti), what can you say about how me
639639
### Summary
640640

641641
Fluently working with categorical variables is an important skill for data analysts.
642-
In this chapter we've introduced different visualizations and numerical summaries applied to categorical variables.
642+
In this chapter we have introduced different visualizations and numerical summaries applied to categorical variables.
643643
The graphical visualizations are even more descriptive when two variables are presented simultaneously.
644644
We presented bar plots, mosaic plots, pie charts, and estimations of conditional proportions.
645645

@@ -648,7 +648,7 @@ We presented bar plots, mosaic plots, pie charts, and estimations of conditional
648648
We introduced the following terms in the chapter.
649649
If you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.
650650
We are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.
651-
However you should be able to easily spot them as **bolded text**.
651+
However, you should be able to easily spot them as **bolded text**.
652652

653653
```{r}
654654
make_terms_table(terms_chp_4)

0 commit comments

Comments
 (0)