OpenIntroStat
diff --git a/‎01-data-hello.Rmd
Lines changed: 3 additions & 3 deletions b/‎01-data-hello.Rmd
Lines changed: 3 additions & 3 deletions
diff --git a/‎02-data-design.Rmd
Lines changed: 5 additions & 5 deletions b/‎02-data-design.Rmd
Lines changed: 5 additions & 5 deletions
diff --git a/‎03-data-applications.Rmd
Lines changed: 1 addition & 1 deletion b/‎03-data-applications.Rmd
Lines changed: 1 addition & 1 deletion
diff --git a/‎04-explore-categorical.Rmd
Lines changed: 11 additions & 11 deletions b/‎04-explore-categorical.Rmd
Lines changed: 11 additions & 11 deletions
@@ -132,7 +132,7 @@ It is possible that the 8% difference in the stent study is due to this natural
 However, the larger the difference we observe (for a particular sample size), the less believable it is that the difference is due to chance.
 So, what we are really asking is the following: if in fact stents have no effect, how likely is it that we observe such a large difference?
 
-While we don't yet have statistical tools to fully address this question on our own, we can comprehend the conclusions of the published analysis: there was compelling evidence of harm by stents in this study of stroke patients.
+While we do not yet have statistical tools to fully address this question on our own, we can comprehend the conclusions of the published analysis: there was compelling evidence of harm by stents in this study of stroke patients.
 
 **Be careful:** Do not generalize the results of this study to all patients and all stents.
 This study looked at patients with very specific characteristics who volunteered to be a part of this study and who may not be representative of all stroke patients.
@@ -288,7 +288,7 @@ Examine the `unemployment_rate`, `pop2017`, `state`, and `median_edu` variables
 Each of these variables is inherently different from the other three, yet some share certain characteristics.
 
 First consider `unemployment_rate`, which is said to be a \index{numerical variable}**numerical** variable since it can take a wide range of numerical values, and it is sensible to add, subtract, or take averages with those values.
-On the other hand, we would not classify a variable reporting telephone area codes as numerical since the average, sum, and difference of area codes doesn't have any clear meaning.
+On the other hand, we would not classify a variable reporting telephone area codes as numerical since the average, sum, and difference of area codes does not have any clear meaning.
 Instead, we would consider area codes as a categorical variable.
 
 ```{r include=FALSE}
@@ -550,7 +550,7 @@ terms_chp_1 <- c(terms_chp_1, "experiment", "randomized experiment", "placebo")
 Researchers perform an **observational study** when they collect data in a way that does not directly interfere with how the data arise.
 For instance, researchers may collect information via surveys, review medical or company records, or follow a **cohort** of many similar individuals to form hypotheses about why certain diseases might develop.
 In each of these situations, researchers merely observe the data that arise.
-In general, observational studies can provide evidence of a naturally occurring association between variables, but they cannot by themselves show a causal connection as they don't offer a mechanism for controlling for confounding variables.
+In general, observational studies can provide evidence of a naturally occurring association between variables, but they cannot by themselves show a causal connection as they do not offer a mechanism for controlling for confounding variables.
 
 ```{r include=FALSE}
 terms_chp_1 <- c(terms_chp_1, "observational study", "cohort")
 
@@ -102,7 +102,7 @@ include_graphics("images/mn-winter/mn-winter.jpg")
 
 Anecdotal evidence typically is composed of unusual cases that we recall based on their striking characteristics.
 For instance, we are more likely to remember the two people we met who took 7 years to graduate than the six others who graduated in four years.
-Instead of looking at the most unusual cases, we should examine a sample of many cases that better represent the population.
+Instead, of looking at the most unusual cases, we should examine a sample of many cases that better represent the population.
 
 ### Sampling from a population
 
@@ -420,7 +420,7 @@ par(par_og)                           # restore original par
 ```
 
 Sometimes cluster or multistage sampling can be more economical than the alternative sampling techniques.
-Also, unlike stratified sampling, these approaches are most helpful when there is a lot of case-to-case variability within a cluster but the clusters themselves don't look very different from one another.
+Also, unlike stratified sampling, these approaches are most helpful when there is a lot of case-to-case variability within a cluster but the clusters themselves do not look very different from one another.
 For example, if neighborhoods represented clusters, then cluster or multistage sampling work best when the populations inside each neighborhood are very diverse.
 A downside of these methods is that more advanced techniques are typically required to analyze the data, though the methods in this book can be extended to handle such data.
 
@@ -528,13 +528,13 @@ terms_chp_2 <- c(terms_chp_2, "treatment group", "control group")
 
 Put yourself in the place of a person in the study.
 If you are in the treatment group, you are given a fancy new drug that you anticipate will help you.
-On the other hand, a person in the other group doesn't receive the drug and sits idly, hoping her participation doesn't increase her risk of death.
+On the other hand, a person in the other group does not receive the drug and sits idly, hoping her participation does not increase her risk of death.
 These perspectives suggest there are actually two effects in this study: the one of interest is the effectiveness of the drug, and the second is an emotional effect of (not) taking the drug, which is difficult to quantify.
 
 Researchers aren't usually interested in the emotional effect, which might bias the study.
 To circumvent this problem, researchers do not want patients to know which group they are in.
 When researchers keep the patients uninformed about their treatment, the study is said to be **blind**.
-But there is one problem: if a patient doesn't receive a treatment, they will know they're in the control group.
+But there is one problem: if a patient does not receive a treatment, they will know they're in the control group.
 A solution to this problem is to give a fake treatment to patients in the control group.
 This is called a **placebo**, and an effective placebo is the key to making a study truly blind.
 A classic example of a placebo is a sugar pill that is made to look like the actual treatment pill.
@@ -583,7 +583,7 @@ These questions may have even arisen in your mind when in the general experiment
 
 There are always multiple viewpoints of experiments and placebos, and rarely is it obvious which is ethically "correct".
 For instance, is it ethical to use a sham surgery when it creates a risk to the patient?
-However, if we don't use sham surgeries, we may promote the use of a costly treatment that has no real effect; if this happens, money and other resources will be diverted away from other treatments that are known to be helpful.
+However, if we do not use sham surgeries, we may promote the use of a costly treatment that has no real effect; if this happens, money and other resources will be diverted away from other treatments that are known to be helpful.
 Ultimately, this is a difficult situation where we cannot perfectly protect both the patients who have volunteered for the study and the patients who may benefit (or not) from the treatment in the future.
 
 ## Observational studies
 
@@ -93,7 +93,7 @@ passwords_var_def %>%
   column_spec(2, width = "30em")
 ```
 
-We now have a better sense of what each column represents, but we don't yet know much about the characteristics of each of the variables.
+We now have a better sense of what each column represents, but we do not yet know much about the characteristics of each of the variables.
 
 ::: {.workedexample data-latex=""}
 Determine whether each variable in the passwords dataset is numerical or categorical.
 
@@ -83,7 +83,7 @@ A bar plot is a common way to display a single categorical variable.
 The left panel of Figure \@ref(fig:loan-homeownership-bar-plot) shows a **bar plot** for the `homeownership` variable.
 In the right panel, the counts are converted into proportions, showing the proportion of observations that are in each level.
 
-```{r loan-homeownership-bar-plot, fig.cap = "Two bar plots: the left panel shows the counts and the right panel shows the proportions of values of the homeownership variable.", fig.asp=0.5}
+```{r loan-homeownership-bar-plot, fig.cap = "Two bar plots: the left panel shows the counts, and the right panel shows the proportions of values of the homeownership variable.", fig.asp=0.5}
 p_count <- ggplot(loans, aes(x = homeownership)) +
   geom_bar(fill = IMSCOL["green", "full"]) + 
   labs(x = "Homeownership", y = "Count")
@@ -146,7 +146,7 @@ When is the stacked, dodged, or standardized bar plot the most useful?
 
 ------------------------------------------------------------------------
 
-The stacked bar plot is most useful when it's reasonable to assign one variable as the explanatory variable (here `homeownership`) and the other variable as the response (here `application_type`), since we are effectively grouping by one variable first and then breaking it down by the others.
+The stacked bar plot is most useful when it's reasonable to assign one variable as the explanatory variable (here `homeownership`) and the other variable as the response (here `application_type`) since we are effectively grouping by one variable first and then breaking it down by the others.
 
 Dodged bar plots are more agnostic in their display about which variable, if any, represents the explanatory and which the response variable.
 It is also easy to discern the number of cases in each of the six different group combinations.
@@ -196,7 +196,7 @@ p_mosaic_1 + p_mosaic_2 +
 
 In Figure \@ref(fig:loan-homeownership-type-mosaic-plot), we chose to first split by the homeowner status of the borrower.
 However, we could have instead first split by the application type, as in Figure \@ref(fig:loan-app-type-mosaic-plot).
-Like with the bar plots, it's common to use the explanatory variable to represent the first split in a mosaic plot, and then for the response to break up each level of the explanatory variable, if these labels are reasonable to attach to the variables under consideration.
+Like with the bar plots, it's common to use the explanatory variable to represent the first split in a mosaic plot, and then for the response to break up each level of the explanatory variable if these labels are reasonable to attach to the variables under consideration.
 
 ```{r loan-app-type-mosaic-plot, fig.cap = "Mosaic plot where loans are grouped by homeownership after they have been divided into individual and joint application types."}
 ggplot(loans) +
@@ -213,7 +213,7 @@ However, we have not discussed how the values in the bar and mosaic plots that s
 In this section we will investigate fractional breakdown of one variable in another and we can modify our contingency table to provide such a view.
 Table \@ref(tab:loan-home-app-type-row-proportions) shows **row proportions** for Table \@ref(tab:loan-home-app-type-totals), which are computed as the counts divided by their row totals.
 The value 3496 at the intersection of individual and rent is replaced by $3496 / 8505 = 0.411,$ i.e., 3496 divided by its row total, 8505.
-So what does 0.411 represent?
+So, what does 0.411 represent?
 It corresponds to the proportion of individual applicants who rent.
 
 ```{r loan-home-app-type-row-proportions, out.width = "70%"}
@@ -280,7 +280,7 @@ What does 0.135 represent in Table \@ref(tab:loan-home-app-type-column-proportio
 Data scientists use statistics to build email spam filters.
 By noting specific characteristics of an email, a data scientist may be able to classify some emails as spam or not spam with high accuracy.
 One such characteristic is whether the email contains no numbers, small numbers, or big numbers.
-Another characteristic is the email format, which indicates whether or not an email has any HTML content, such as bolded text.
+Another characteristic is the email format, which indicates whether an email has any HTML content, such as bolded text.
 We'll focus on email format and spam status using the dataset; these variables are summarized in a contingency table in Table \@ref(tab:email-count-table).
 Which would be more helpful to someone hoping to classify email as spam or regular email for this table: row or column proportions?
 
@@ -322,7 +322,7 @@ Are there any obvious scenarios where one might be more useful than the other?
 ------------------------------------------------------------------------
 
 None that we think are obvious!
-What is distinct about the email example is that the two loan variables don't have a clear explanatory-response variable relationship that we might hypothesize.
+What is distinct about the email example is that the two loan variables do not have a clear explanatory-response variable relationship that we might hypothesize.
 Usually it is most useful to "condition" on the explanatory variable.
 For instance, in the email example, the email format was seen as a possible explanatory variable of whether the message was spam, so we would find it more interesting to compute the relative frequencies (proportions) for each email format.
 :::
@@ -358,7 +358,7 @@ p_pie + p_bar
 ```
 
 Pie charts can work well when the goal is to visualize a categorical variable with very few levels, and especially if each level represents a simple fraction (e.g., one-half, one-quarter, etc.).
-However they can be quite difficult to read when they are used to visualize a categorical variable with many levels.
+However, they can be quite difficult to read when they are used to visualize a categorical variable with many levels.
 For example, the pie chart and the bar plot in Figure \@ref(fig:loan-grade-pie-chart) both represent the distribution of loan grades (A through G).
 In this case, it is far easier to compare the counts of each loan grade using the bar plot than the pie chart.
 
@@ -391,7 +391,7 @@ Just like with pie charts, they work best when the number of levels represented
 However, unlike pie charts, they can make it easier to compare proportions that represent non-simple fractions.
 Figure \@ref(fig:loan-waffle) displays two examples of waffle charts: one for the distribution of homeownership and the other for the distribution of loan status.
 
-```{r loan-waffle, fig.cap = "Plot A: Waffle chart of homeownership, with levels rent, morgage, and own. Plot B: Waffle chart of loan status, with levels current, fully paid, in grade period, and late.", fig.asp = 0.5, fig.width=8}
+```{r loan-waffle, fig.cap = "Plot A: Waffle chart of homeownership, with levels rent, mortgage, and own. Plot B: Waffle chart of loan status, with levels current, fully paid, in grade period, and late.", fig.asp = 0.5, fig.width=8}
 p_waffle_homeownership <- loans %>%
   count(homeownership) %>%
   ggplot(aes(fill = homeownership, values = n)) +
@@ -421,7 +421,7 @@ p_waffle_homeownership +
 ## Comparing numerical data across groups
 
 Some of the more interesting investigations can be considered by examining numerical data across groups.
-In this section we will expand on a few methods we've already seen to make plots for numerical data from multiple groups on the same graph as well as introduce a few new methods for comparing numerical data across groups.
+In this section we will expand on a few methods we have already seen to make plots for numerical data from multiple groups on the same graph as well as introduce a few new methods for comparing numerical data across groups.
 
 We will revisit the `county` dataset and compare the median household income for counties that gained population from 2010 to 2017 versus counties that had no gain.
 While we might like to make a causal connection between income and population growth, remember that these are observational data and so such an interpretation would be, at best, half-baked.
@@ -639,7 +639,7 @@ Based on Figure \@ref(fig:countyIncomeRidgeMulti), what can you say about how me
 ### Summary
 
 Fluently working with categorical variables is an important skill for data analysts.
-In this chapter we've introduced different visualizations and numerical summaries applied to categorical variables.
+In this chapter we have introduced different visualizations and numerical summaries applied to categorical variables.
 The graphical visualizations are even more descriptive when two variables are presented simultaneously.
 We presented bar plots, mosaic plots, pie charts, and estimations of conditional proportions.
 
@@ -648,7 +648,7 @@ We presented bar plots, mosaic plots, pie charts, and estimations of conditional
 We introduced the following terms in the chapter.
 If you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.
 We are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.
-However you should be able to easily spot them as **bolded text**.
+However, you should be able to easily spot them as **bolded text**.
 
 ```{r}
 make_terms_table(terms_chp_4)