diff --git a/_freeze/learn/statistics/survival-metrics-details/index/execute-results/html.json b/_freeze/learn/statistics/survival-metrics-details/index/execute-results/html.json new file mode 100644 index 00000000..983ddc8e --- /dev/null +++ b/_freeze/learn/statistics/survival-metrics-details/index/execute-results/html.json @@ -0,0 +1,14 @@ +{ + "hash": "216c3cc46e265e38741b92572640f385", + "result": { + "markdown": "---\ntitle: \"Accounting for Censoring in Performance Metrics for Event Time Data\"\ncategories:\n - statistical analysis\n - survival analysis\ntype: learn-subsection\nweight: 9\ndescription: | \n Learn how tidymodels uses causal inference tools to measure performance of \n survival models.\ntoc: true\ntoc-depth: 2\ninclude-after-body: ../../../resources.html\n---\n\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: censored, prodlim, and tidymodels. You'll need the development versions of censored and parsnip. To install these, use\n\n```r\n#install.packages(\"pak\")\n\npak::pak(c(\"tidymodels/censored\", \"tidymodels/parsnip\"))\n```\n\nOne trend in modern survival analysis is to compute time-dependent measures of performance. These are primarily driven by an increased focus on predictions for the probability of survival at a given time (as opposed to the predictions of event times or linear predictors). Since these are conditional on the time of evaluation, we call them dynamic performance metrics.\n\nMany dynamic metrics are similar to those used for binary classification models. Examples include the Brier score and ROC curves (see the [Dynamic Performance Metrics for Event Time Data](../survival-metrics/) article for details). The basic idea is that, for a given time $t$ for model evaluation, we try to encode the observed event time data into a binary \"has there been an event at time $t$?\" version. We also convert the predicted survival probabilities into predicted events/non-events based on a threshold (default is 0.50). The survival versions of these metrics need those binary versions of observed truth and predictions as well as a way to account for censoring.\n\nCensoring plays into the details of the conversion and is additionally captured in the form of weights. This article covers both those aspects in detail to complement the [main article](../survival-metrics/) on performance metrics for event time data.\n\nTo start, let's define the various types of times that will be mentioned:\n\n- Observed time: time recorded in the data\n- Event time: observed times for actual events\n- Evaluation time: the time, specified by the analyst, that the model is evaluated. \n\n## Example data\n\nAs an example, we'll simulate some data with the prodlim package, using the methods of [Bender _et al_ (2005)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=%22Generating+survival+times+to+simulate+Cox+proportional+hazards+models.%22&btnG=). A training and a validation set are simulated. We'll also load the censored package so that we can fit a model to these time-to-event data:\n\n\n::: {.cell layout-align=\"center\" hash='cache/data_ff9ec6e0c8954db2cbaef260b25dc772'}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(censored)\n#> Loading required package: survival\nlibrary(prodlim)\n\nset.seed(5882)\nsim_dat <- SimSurv(2000) %>%\n mutate(event_time = Surv(time, event)) %>%\n select(event_time, X1, X2)\n\nset.seed(2)\nsplit <- initial_split(sim_dat)\nsim_tr <- training(split)\nsim_val <- testing(split)\n```\n:::\n\n\nWe'll need a model to illustrate the code and concepts. Let's fit a bagged survival tree model to the training set:\n\n\n::: {.cell layout-align=\"center\" hash='cache/bag-tree-fit_5a6c3b5b8b2141206912aa7a2fe8f384'}\n\n```{.r .cell-code}\nset.seed(17)\nbag_tree_fit <- \n bag_tree() %>% \n set_mode(\"censored regression\") %>% \n set_engine(\"rpart\") %>% \n fit(event_time ~ ., data = sim_tr)\nbag_tree_fit\n#> parsnip model object\n#> \n#> \n#> Bagging survival trees with 25 bootstrap replications \n#> \n#> Call: bagging.data.frame(formula = event_time ~ ., data = data)\n```\n:::\n\n\nUsing this model, we can make predictions of different types and `augment()` provides us with a version of the data augmented with the various predictions. Here we are interested in the predicted probability of survival at different evaluation time points. The largest event time in the training set is 21.083 so we will use a set of evaluation times between zero and 21. \n\n\n::: {.cell layout-align=\"center\" hash='cache/val-pred_71be0c59aa8304196d00fda8a33005d7'}\n\n```{.r .cell-code}\ntime_points <- seq(0, 21, by = 0.25)\n\nval_pred <- augment(bag_tree_fit, sim_val, eval_time = time_points)\nval_pred\n#> # A tibble: 500 × 5\n#> .pred .pred_time event_time X1 X2\n#> \n#> 1 6.66 4.831957 1 -0.630 \n#> 2 6.66 6.110031 1 -0.606 \n#> 3 7.47 6.597774+ 1 -1.03 \n#> 4 3.29 2.717484 1 0.811 \n#> 5 5.10 4.727042+ 1 -0.376 \n#> 6 4.99 8.699061 0 1.18 \n#> 7 7.23 10.818670 1 -0.851 \n#> 8 6.46 6.886378 0 0.493 \n#> 9 4.75 2.451893+ 1 0.0207\n#> 10 13.4 8.231911+ 0 -1.52 \n#> # ℹ 490 more rows\n```\n:::\n\n\nThe observed data are in the `event_time` column. The predicted survival probabilities are in the `.pred` column. This is a list column with a data frame for each observation, containing the predictions at the 85 evaluation time points in the (nested) column `.pred_survival`. \n\n\n::: {.cell layout-align=\"center\" hash='cache/val-pred-dynamic_4711b5e193450241734314b22e92c8e2'}\n\n```{.r .cell-code}\nval_pred$.pred[[1]]\n#> # A tibble: 85 × 5\n#> .eval_time .pred_survival .weight_time .pred_censored .weight_censored\n#> \n#> 1 0 1 0 1 1 \n#> 2 0.25 1 0.250 0.999 1.00\n#> 3 0.5 0.999 0.500 0.996 1.00\n#> 4 0.75 0.992 0.750 0.993 1.01\n#> 5 1 0.988 1.00 0.991 1.01\n#> 6 1.25 0.980 1.25 0.987 1.01\n#> 7 1.5 0.972 1.50 0.981 1.02\n#> 8 1.75 0.959 1.75 0.971 1.03\n#> 9 2 0.938 2.00 0.966 1.04\n#> 10 2.25 0.925 2.25 0.959 1.04\n#> # ℹ 75 more rows\n```\n:::\n\n\nFirst, let's dive into how to convert the observed event time in `event_time` to a binary version. Then we will discuss the remaining columns as part of generating the required weights for the dynamic performance metrics.\n\n\n## Converting censored data to binary data\n\nTo assess model performance at evaluation time $t$, we turn the observed event time data into a binary “was there an event at time $t$?” version. For this, we follow the process described by [Graf _et al_ (1999)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=%22Assessment+and+Comparison+of+Prognostic+Classification+Schemes+for+Survival+Data.%22&btnG=) where observations are categorized into three groups, at evaluation time $t$.\n\n - **Category 1 - Events**: Evaluation time is greater than or equal to the event time (\"it has already happened\").\n - **Category 2 - Non-events**: Evaluation time is less than the observed time, censored or not (\"nothing has happened yet\"). \n - **Category 3 - Ambiguous outcomes**: Evaluation time is greater than or equal to the observed censored time (\"we don't know if anything might have happened by now\"). \n\nWe can use binary versions of the observations in the first two categories to compute binary performance metrics, but the observations in the third category are not used directly in these calculations. (They do influence the calculation of the weights, see next section.) So our usable sample size changes with the evaluation time.\n\n\n::: {.cell layout-align=\"center\" hash='cache/plot-graf-categories_dfa05307f158925bfef894bc8b6ca0ae'}\n::: {.cell-output-display}\n![](figs/plot-graf-categories-1.svg){fig-align='center' width=864}\n:::\n:::\n\n\n## Censoring weights\n\nUnfortunately, this categorization scheme alone is not sufficient to compute metrics. Graf _et al_ took a page from the causal inference literature and added a propensity-type score based on the likelihood that each data point would be censored (regardless of the observed event status). This is not the probability than the original time-to-event data point is censored but rather the probability that at evaluation time, we have not observed an event (or a censoring) yet, i.e., that the data point falls into category 2.\n\nHow do we compute this probability? The standard approach is to compute a \"reverse Kaplan-Meier\" (RKM) curve. Ordinarily, the Kaplan-Meier (KM) curve models the probability of survival. For the reverse Kaplan-Meier curve, the meaning of the status indicator is flipped, i.e., the event of interest changes from \"death\" to \"censoring\". This should give us a reasonably reliable non-parametric model for estimating the probability of being censored at a given time. \n\nEvery time a censored regression model is created using tidymodels, the RKM is estimated on the same data used to fit the model and attached to the parsnip object. \n\nFor our simulated data, here is what the RKM curve looks like: \n\n\n::: {.cell layout-align=\"center\" hash='cache/RKM_51d587d88b6e7aeb39b5884836692a31'}\n::: {.cell-output-display}\n![](figs/RKM-1.svg){fig-align='center' width=672}\n:::\n:::\n\n\nThe red rug on the bottom shows the training point event times and the blue values at the top represent the times for the censored training set observations. As (evaluation) time increases, we pass more and more observed time points, and the probability of being censored, i.e., the probability of an observation to fall into category 2, decreases.\n\nThe weights used in the calculation of the dynamic performance metrics are the inverse of these probabilities, hence the name \"inverse probability of censoring weights\" (IPCW). These weights should theoretically balance the exposure/effect/influence that the definitive observations have on performance calculations. \n\n### The finer details\n\nFirst, when do we evaluate the probability of censoring? There are different approaches used in the literature, and we follow what Graf _et al_ suggest (as it seems most appropriate):\n\n- If the evaluation time is less than the observed time (like in category 2), the evaluation time is used to predict the probability of censoring.\n- If the evaluation is greater than or equal to the event time (like in category 1), the event time is used to predict the probability of censoring.\n- If the evaluation time is greater than or equal to the observed censoring time, the observation falls into category 3 and is not used, i.e., also no weight is needed.\n\nWe call this time at which to predict the probability of censoring the _weight time_. Here's an example using the first data point in the validation set: \n\n\n::: {.cell layout-align=\"center\" hash='cache/eval-time-censored_65e017db80a7df2ba90f97b6ff4fc509'}\n\n```{.r .cell-code}\ndyn_val_pred <- \n val_pred %>% \n select(.pred, event_time) %>% \n add_rowindex() %>% \n unnest(.pred) \n\ndyn_val_pred %>% \n filter(.row == 1 & .eval_time %in% c(1, 2, 4, 5, 10)) %>% \n select(event_time, .eval_time, .weight_time, .pred_censored, .weight_censored)\n#> # A tibble: 5 × 5\n#> event_time .eval_time .weight_time .pred_censored .weight_censored\n#> \n#> 1 4.831957 1 1.00 0.991 1.01\n#> 2 4.831957 2 2.00 0.966 1.04\n#> 3 4.831957 4 4.00 0.848 1.18\n#> 4 4.831957 5 4.83 0.786 1.27\n#> 5 4.831957 10 4.83 0.786 1.27\n```\n:::\n\n\n\n\nThis observation was an event, observed at time 4.832 The column `.weight_time` captures at which time the probability of censoring was calculated. Up until that event time, the probability of being censored is computed at the evaluation time. After that, it is based on the event time. \n\nWe add a slight modification to the weight time: If our evaluation time is today, we don't have today's data yet. In tidymodels, we calculate the probability of censoring just before the requested weight time. We are basically subtracting a small numerical value from the weight time used in the RKM model. You'll only really see a difference if there is a bulk of censored observations at the original evaluation time.\n\nFinally, we use a simple RKM curve (i.e., no covariates or strata). This implies that there is non-informative censoring. Other applications of IPCW try to mitigate the effects of informative censoring. In the future, we may allow the censoring model to include covariates (as well as models beyond the RKM). \n\n## Illustration: Confusion matrix\n\nTo illustrate how these two tools for accounting for censoring are used in calculating dynamic performance metrics, we'll take a look here at the 2x2 confusion matrices at a few evaluation time points. More details on performance metrics for censored data are in the aforementioned [Dynamic Performance Metrics for Event Time Data](../survival-metrics/) article.\n\nFirst, let's turn the observed event time data and the predictions into their binary versions.\n\n\n::: {.cell layout-align=\"center\" hash='cache/binary-encoding_0b44b3bcd60032b1307944cf32ed87ba'}\n\n```{.r .cell-code}\ntime_as_binary_event <- function(surv, eval_time) {\n event_time <- .extract_surv_time(surv)\n status <- .extract_surv_status(surv)\n is_event_before_t <- event_time <= eval_time & status == 1\n \n # Three possible contributions to the statistic from Graf 1999\n # Censoring time before eval_time, no contribution (Graf category 3)\n binary_res <- rep(NA_character_, length(event_time))\n \n # A real event prior to eval_time (Graf category 1)\n binary_res <- if_else(is_event_before_t, \"event\", binary_res)\n \n # Observed time greater than eval_time (Graf category 2)\n binary_res <- if_else(event_time > eval_time, \"non-event\", binary_res)\n factor(binary_res, levels = c(\"event\", \"non-event\"))\n}\n\nbinary_encoding <- \n dyn_val_pred %>% \n mutate(\n obs_class = time_as_binary_event(event_time, .eval_time),\n pred_class = if_else(.pred_survival >= 1 / 2, \"non-event\", \"event\"),\n pred_class = factor(pred_class, levels = c(\"event\", \"non-event\")),\n )\n```\n:::\n\n\nRemember how observations falling into category 3 are removed from the analysis? This means we'll likely have fewer data points to evaluate as the evaluation time increases. This implies that the variation in the metrics will be considerable as evaluation time goes on. For our simulated training set: \n\n\n::: {.cell layout-align=\"center\" hash='cache/usable-data_f60360b9dc0bf6ae1daa097b9239db67'}\n\n```{.r .cell-code}\ndyn_val_pred %>% \n summarize(num_usable = sum(!is.na(.weight_censored)), .by = c(.eval_time)) %>% \n ggplot() + \n geom_step(aes(.eval_time, num_usable)) +\n labs(x = \"time\", y = \"number of usable observations\") +\n lims(y = c(0, nrow(sim_val))) +\n theme_bw()\n```\n\n::: {.cell-output-display}\n![](figs/usable-data-1.svg){fig-align='center' width=672}\n:::\n:::\n\n\nKeeping this in mind, let's look at what happens with the data points we can use. Let's start with an evaluation time of 1.00. To compute the confusion matrix for a classification problem, we would simply use:\n\n```r\nbinary_encoding %>% \n filter(.eval_time == 1.00) %>% \n conf_mat(truth = obs_class, estimate = pred_class)\n```\n\nFor censored regression problems, we need to additionally use the censoring weights so we'll include them via the `case_weights` argument:\n\n\n::: {.cell layout-align=\"center\" hash='cache/conf-mat-01_e7df6bde139b7cfe153201d59d66da45'}\n\n```{.r .cell-code}\nbinary_encoding %>%\n filter(.eval_time == 1.00) %>%\n conf_mat(truth = obs_class,\n estimate = pred_class,\n case_weights = .weight_censored)\n#> Truth\n#> Prediction event non-event\n#> event 0.00000 0.00000\n#> non-event 14.11046 482.54963\n```\n:::\n\n\n\n\nThe values in the cells are the sum of the censoring weights, There are 14 actual events (out of 492 usable observations) before this evaluation time, so there are empty cells. Also note that the cell values are close to the actual counts. This early, the predicted censoring probabilities are very close to one so their inverse values are also. \n\nThis early, performance looks very good but that is mostly because there are few events.\n\nLet's shift to an evaluation time of 5.0. \n\n\n::: {.cell layout-align=\"center\" hash='cache/conf-mat-05_2c03063a21b13db700d3dbda9f6c526f'}\n\n```{.r .cell-code}\nbinary_encoding %>%\n filter(.eval_time == 5.00) %>%\n conf_mat(truth = obs_class,\n estimate = pred_class,\n case_weights = .weight_censored)\n#> Truth\n#> Prediction event non-event\n#> event 112.98854 54.36531\n#> non-event 56.14133 252.41039\n```\n:::\n\n\n\n\nNow we have fewer total observations to consider (391 instead of 492 usable values) and more events (154 this time). Performance is fairly good; the sensitivity is 66.8% and the specificty is 82.3%.\n\nWhat happends when the evaluation time is 17?\n\n\n::: {.cell layout-align=\"center\" hash='cache/conf-mat-17_b81a8365793c0e05c0fed48ebeed0565'}\n\n```{.r .cell-code}\nbinary_encoding %>%\n filter(.eval_time == 17.00) %>%\n conf_mat(truth = obs_class,\n estimate = pred_class,\n case_weights = .weight_censored)\n#> Truth\n#> Prediction event non-event\n#> event 429.9920 123.4458\n#> non-event 0.0000 0.0000\n```\n:::\n\n\n\n\nThe data are overwhelmingly events. Also, the censoring weights are larger now since the probability of being censored is very low. The mean censoring weight is 1.96.\n\nThis concludes the illustration of how to account for censoring when using a confusion matrix for performance assessment. There's more on dynamic performance metrics in the [Dynamic Performance Metrics for Event Time Data](../survival-metrics/) article.\n\n## Summary\n\nWhen accounting for censoring in dynamic performance metrics, the main points to remember are:\n\n* Event time data can be converted to a binary format.\n* Some data points cannot be used in the calculations. \n* To properly estimate statistical quantities, we weight the computations by the inverse of the probability of being censored. \n* tidymodels currently assumes non-informative censoring. \n\n\n## Session information {#session-info}\n\n\n::: {.cell layout-align=\"center\" hash='cache/si_43a75b68dcc94565ba13180d7ad26a69'}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.3.0 (2023-04-21)\n#> os macOS Ventura 13.4\n#> system aarch64, darwin20\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/Los_Angeles\n#> date 2023-07-02\n#> pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.4 2023-03-11 [1] CRAN (R 4.3.0)\n#> censored * 0.2.0.9000 2023-07-02 [1] Github (tidymodels/censored@f9eccb6)\n#> dials * 1.2.0 2023-04-03 [1] CRAN (R 4.3.0)\n#> dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n#> ggplot2 * 3.4.2 2023-04-03 [1] CRAN (R 4.3.0)\n#> infer * 1.0.4 2022-12-02 [1] CRAN (R 4.3.0)\n#> parsnip * 1.1.0.9003 2023-07-02 [1] Github (tidymodels/parsnip@e6278b0)\n#> prodlim * 2023.03.31 2023-04-02 [1] CRAN (R 4.3.0)\n#> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)\n#> recipes * 1.0.6 2023-04-25 [1] CRAN (R 4.3.0)\n#> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n#> rsample * 1.1.1 2022-12-07 [1] CRAN (R 4.3.0)\n#> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n#> tidymodels * 1.1.0 2023-05-01 [1] CRAN (R 4.3.0)\n#> tune * 1.1.1 2023-04-11 [1] CRAN (R 4.3.0)\n#> workflows * 1.1.3 2023-02-22 [1] CRAN (R 4.3.0)\n#> yardstick * 1.2.0 2023-04-21 [1] CRAN (R 4.3.0)\n#> \n#> [1] /Users/emilhvitfeldt/Library/R/arm64/4.3/library\n#> [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/learn/statistics/survival-metrics/index/execute-results/html.json b/_freeze/learn/statistics/survival-metrics/index/execute-results/html.json new file mode 100644 index 00000000..bb54b360 --- /dev/null +++ b/_freeze/learn/statistics/survival-metrics/index/execute-results/html.json @@ -0,0 +1,14 @@ +{ + "hash": "159d5113ce7d9d0ce713e685f247d52a", + "result": { + "markdown": "---\ntitle: \"Dynamic Performance Metrics for Event Time Data\"\ncategories:\n - statistical analysis\n - survival analysis\ntype: learn-subsection\nweight: 9\ndescription: | \n Let's discuss how to compute modern performance metrics for time-to-event models.\ntoc: true\ntoc-depth: 2\ninclude-after-body: ../../../resources.html\n---\n\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: censored, prodlim, and tidymodels. You'll need the development versions of censored and parsnip. To install these, use\n\n```r\n#install.packages(\"pak\")\n\npak::pak(c(\"tidymodels/censored\", \"tidymodels/parsnip\"))\n```\n\nOne trend in modern survival analysis is to compute time-dependent measures of performance. These are primarily driven by an increased focus on predictions for the probability of survival at a given time (as opposed to the predictions of event times or linear predictors). Since these are conditional on the time of evaluation, we call them dynamic performance metrics. \n\nMany dynamic metrics are similar to those used for binary classification models. The basic idea is that, for a given time $t$ for model evaluation, we try to encode the observed event time data into a binary \"has there been an event at time $t$?\" version. We can also convert the predicted survival probabilities into predicted events/non-events based on a threshold (default is 0.50). The survival versions of these metrics need those binary versions of observed truth and predictions as well as a way to account for censoring.\n\nCensoring plays into the details of the conversion and is additionally captured in the form of weights. For details on both these aspects, see the [Accounting for Censoring in Performance Metrics for Event Time Data](../survival-metrics-details) article.\n\nTo start, let's define the various types of times that will be mentioned:\n\n- Observed time: time recorded in the data\n- Event time: observed times for actual events\n- Evaluation time: the time, specified by the analyst, that the model is evaluated. \n\n## Example data\n\nAs an example, we'll simulate some data with the prodlim package, using the methods of [Bender _et al_ (2005)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=%22Generating+survival+times+to+simulate+Cox+proportional+hazards+models.%22&btnG=). A training and a validation set are simulated. We'll also load the censored package so that we can fit a model to these time-to-event data:\n\n\n::: {.cell layout-align=\"center\" hash='cache/data_ea43f6471536c88f5d0c74d179663a45'}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(censored)\n#> Loading required package: survival\nlibrary(prodlim)\n\nset.seed(5882)\nsim_dat <- SimSurv(2000) %>%\n mutate(event_time = Surv(time, event)) %>%\n select(event_time, X1, X2)\n\nset.seed(2)\nsplit <- initial_split(sim_dat)\nsim_tr <- training(split)\nsim_val <- testing(split)\n\n## Resampling object\nsim_rs <- vfold_cv(sim_tr)\n```\n:::\n\n\nWe'll need a model to illustrate the code and concepts. Let's fit a bagged survival tree model to the training set:\n\n\n::: {.cell layout-align=\"center\" hash='cache/bag-tree-fit_5a6c3b5b8b2141206912aa7a2fe8f384'}\n\n```{.r .cell-code}\nset.seed(17)\nbag_tree_fit <- \n bag_tree() %>% \n set_mode(\"censored regression\") %>% \n set_engine(\"rpart\") %>% \n fit(event_time ~ ., data = sim_tr)\nbag_tree_fit\n#> parsnip model object\n#> \n#> \n#> Bagging survival trees with 25 bootstrap replications \n#> \n#> Call: bagging.data.frame(formula = event_time ~ ., data = data)\n```\n:::\n\n\nUsing this model, we'll make predictions of different types. \n\n## Survival Probability Prediction\n\nThis censored regression model can make static predictions via the predicted event time using `predict(object, type = \"time\")`. It can also create dynamic predictions regarding the probability of survival for each data point at specific times. The syntax for this is \n\n```r\npredict(object, new_data, type = \"survival\", eval_time = numeric())\n```\n\nwhere `eval_time` is a vector of time points at which we want the corresponding survivor function estimates. Alternatively, we can use the `augment()` function to get both types of prediction and automatically attach them to the data. \n\nThe largest event time in the training set is 21.083 so we will use a set of evaluation times between zero and 21. \n\n\n::: {.cell layout-align=\"center\" hash='cache/val-pred_71be0c59aa8304196d00fda8a33005d7'}\n\n```{.r .cell-code}\ntime_points <- seq(0, 21, by = 0.25)\n\nval_pred <- augment(bag_tree_fit, sim_val, eval_time = time_points)\nval_pred\n#> # A tibble: 500 × 5\n#> .pred .pred_time event_time X1 X2\n#> \n#> 1 6.66 4.831957 1 -0.630 \n#> 2 6.66 6.110031 1 -0.606 \n#> 3 7.47 6.597774+ 1 -1.03 \n#> 4 3.29 2.717484 1 0.811 \n#> 5 5.10 4.727042+ 1 -0.376 \n#> 6 4.99 8.699061 0 1.18 \n#> 7 7.23 10.818670 1 -0.851 \n#> 8 6.46 6.886378 0 0.493 \n#> 9 4.75 2.451893+ 1 0.0207\n#> 10 13.4 8.231911+ 0 -1.52 \n#> # ℹ 490 more rows\n```\n:::\n\n\nThe observed data are in the `event_time` column. The predicted survival probabilities are in the `.pred` column. This is a list column with a data frame for each observation, containing the predictions at the 85 evaluation time points in the (nested) column `.pred_survival`. \n\n\n::: {.cell layout-align=\"center\" hash='cache/val-pred-dynamic_4711b5e193450241734314b22e92c8e2'}\n\n```{.r .cell-code}\nval_pred$.pred[[1]]\n#> # A tibble: 85 × 5\n#> .eval_time .pred_survival .weight_time .pred_censored .weight_censored\n#> \n#> 1 0 1 0 1 1 \n#> 2 0.25 1 0.250 0.999 1.00\n#> 3 0.5 0.999 0.500 0.996 1.00\n#> 4 0.75 0.992 0.750 0.993 1.01\n#> 5 1 0.988 1.00 0.991 1.01\n#> 6 1.25 0.980 1.25 0.987 1.01\n#> 7 1.5 0.972 1.50 0.981 1.02\n#> 8 1.75 0.959 1.75 0.971 1.03\n#> 9 2 0.938 2.00 0.966 1.04\n#> 10 2.25 0.925 2.25 0.959 1.04\n#> # ℹ 75 more rows\n```\n:::\n\n\nThe yardstick package currently has two dynamic metrics. Each is described below.\n\n## Brier Score\n\nThe Brier score is a metric that can be used with both classification and event-time models. For classification models, it computes the squared error between the observed outcome (encoded as 0/1) and the corresponding predicted probability for the class. \n\nA little math: suppose that the value $y_{ik}$ is a 0/1 indicator for whether the observed outcome $i$ corresponds to class $k$, and $\\hat{p}_{ik}$ is the estimated class probability. The classification score is then:\n\n$$\nBrier_{class} = \\frac{1}{N}\\sum_{i=1}^N\\sum_{k=1}^C (y_{ik} - \\hat{p}_{ik})^2\n$$\n\nFor survival models, we transform the event time data into a binary version $y_{it}$: is there an event at evaluation time $t$^[Again, see the [Accounting for Censoring in Performance Metrics for Event Time Data](../survival-metrics-details) article for more on this.]. The survival function estimate $\\hat{p}_{it}$ is the probability that corresponds to non-events at time $t$. For example, if there has not been an event at the current evaluation time, our best model should estimate the survival probability to be near one. For observations that are events, the probability estimate is just one minus the survivor estimate. To account for censoring, we also weight each observation with $w_{it}$. The [time-dependent Brier score](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=%22Assessment+and+Comparison+of+Prognostic+Classification+Schemes+for+Survival+Data.%22&btnG=) is: \n\n$$\nBrier_{surv}(t) = \\frac{1}{N}\\sum_{i=1}^N w_{it}\\left[\\underbrace{I(y_{it} = 0)(y_{it} - \\hat{p}_{it})^2}_\\text{non-events} + \\underbrace{I(y_{it} = 1)(y_{it} - (1 - \\hat{p}_{it}))^2}_\\text{events}\\right]\n$$\n\nFor this score, a perfect model has a score of zero, while an uninformative model would have a score of around 1/4. \n\nHow do we compute this using the yardstick package? The function [`brier_survival()`](https://yardstick.tidymodels.org/reference/brier_survival.html) follows the same convention as the other metric functions. The main arguments are:\n\n- `data`: the data frame with the predictions (structured as the output produced by `augment()`, as shown above).\n- `truth`: the name of the column with the `Surv` object.\n- `...`: the name of the column with the dynamic predictions. Within tidymodels, this column is always called `.pred`. In other words, `.pred` should be passed without an argument name. \n\nSince the evaluation times and the case weights are within the `.pred` column, there is no need to specify these. Here are the results of our validation set: \n\n\n::: {.cell layout-align=\"center\" hash='cache/val-pred-brier_0391afe98b2bf9fa705d559462925d9f'}\n\n```{.r .cell-code}\nbrier_scores <-\n val_pred %>% \n brier_survival(truth = event_time, .pred)\nbrier_scores\n#> # A tibble: 85 × 4\n#> .metric .estimator .eval_time .estimate\n#> \n#> 1 brier_survival standard 0 0 \n#> 2 brier_survival standard 0.25 0 \n#> 3 brier_survival standard 0.5 0.00202\n#> 4 brier_survival standard 0.75 0.00796\n#> 5 brier_survival standard 1 0.0266 \n#> 6 brier_survival standard 1.25 0.0402 \n#> 7 brier_survival standard 1.5 0.0563 \n#> 8 brier_survival standard 1.75 0.0785 \n#> 9 brier_survival standard 2 0.0895 \n#> 10 brier_survival standard 2.25 0.0951 \n#> # ℹ 75 more rows\n```\n:::\n\n\nOver time:\n\n\n::: {.cell layout-align=\"center\" hash='cache/brier-scores_0ba183e42477511b0983e526f92c8327'}\n\n```{.r .cell-code}\nbrier_scores %>% \n ggplot(aes(.eval_time, .estimate)) + \n geom_hline(yintercept = 1 / 4, col = \"red\", lty = 3) +\n geom_line() +\n geom_point() + \n labs(x = \"time\", y = \"Brier score\")\n```\n\n::: {.cell-output-display}\n![](figs/brier-scores-1.svg){fig-align='center' width=672}\n:::\n:::\n\n\nThere is also an _integrated_ Brier score. This required evaluation times as inputs but instead of returning each result, it takes the area under the plot shown above. The syntax is the same but the result has a single row: \n\n\n::: {.cell layout-align=\"center\" hash='cache/val-pred-brier-int_0d396825f603e072c6e54819480a4647'}\n\n```{.r .cell-code}\nval_pred %>% brier_survival_integrated(truth = event_time, .pred)\n#> # A tibble: 1 × 3\n#> .metric .estimator .estimate\n#> \n#> 1 brier_survival_integrated standard 0.113\n```\n:::\n\n\nAgain, smaller values are better. \n\n## Receiver Operating Characteristic (ROC) Curves\n\nWhen we not only turn the event time data into a binary representation but also the predicted probabilities, we are in well-chartered classification metrics territory. Sensitivity and specificity are common quantities to compute, we do so here in their weighted version to account for censoring:\n\n- Sensitivity: How well do we predict the events? This is analogous to the true positive rate.\n- Specificity: How well do we predict the non-events? One minus specificity is the false positive rate. \n\nThese depend on the threshold used to turn predicted probabilities into predicted events/non-events. Let's take a look at the distribution of the survival probabilities for our example data at an evaluation time of 5.00. The distributions are separated by the observed class and weighted by the censoring weights. Details of both aspects are the same as for the Brier score and can be found in the [Accounting for Censoring in Performance Metrics for Event Time Data](../survival-metrics-details) article.\n\n\n\n\n::: {.cell layout-align=\"center\" hash='cache/surv-hist-05_54b7c12287ff4cc6a372c72aeeeb32f8'}\n::: {.cell-output-display}\n![](figs/surv-hist-05-1.svg){fig-align='center' width=70%}\n:::\n:::\n\n\n\n\nMore probability values are to the right of the 50% cutoff for the true non-events. Conversely, true events tend to have smaller probabilities. Using this cutoff, the sensitivity would be 66.8% and the specificity would be 82.3%. There are other possible cutoffs for the survival probabilities. Maybe one of these would have better statistics. \n\nROC curves were designed as a general method that, given a collection of continuous predictions, determines an effective threshold such that values above the threshold indicate a specific event. For our purposes, the ROC curve will compute the sensitivity and specificity for _all possible_ probability thresholds. It then plots the true positive rate versus the false positive rate. Generally, we use the area under the ROC curve to quantify it with a single number. Values near one indicate a perfect model, while values near 1/2 occur with non-informative models. \n\n[Blanche _et al_ (2013)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=%22Review+and+comparison+of+ROC+curve+estimators+for+a+time-dependent+outcome+with+marker-dependent+censoring%22&btnG=) gives a good overview of ROC curves for survival analysis and their Section 4.3 is most relevant here. \n\nFor our example at evaluation time $t = 5.00$, the ROC curve is: \n\n\n::: {.cell layout-align=\"center\" hash='cache/roc-5_2af790d64c21bcecd736afc35cf90091'}\n::: {.cell-output-display}\n![](figs/roc-5-1.svg){fig-align='center' width=672}\n:::\n:::\n\n\nThe area under this curve is 0.807. \n\nSince this is a dynamic metric, we compute the AUC for each evaluation time. The syntax is very similar to the Brier code shown above: \n\n\n::: {.cell layout-align=\"center\" hash='cache/val-pred-roc_a13af258a54f3b92eb17af9dc3b5140e'}\n\n```{.r .cell-code}\nroc_scores <-\n val_pred %>% \n roc_auc_survival(truth = event_time, .pred)\nroc_scores\n#> # A tibble: 85 × 4\n#> .metric .estimator .eval_time .estimate\n#> \n#> 1 roc_auc_survival standard 0 0.5 \n#> 2 roc_auc_survival standard 0.25 0.5 \n#> 3 roc_auc_survival standard 0.5 0.869\n#> 4 roc_auc_survival standard 0.75 0.852\n#> 5 roc_auc_survival standard 1 0.734\n#> 6 roc_auc_survival standard 1.25 0.768\n#> 7 roc_auc_survival standard 1.5 0.792\n#> 8 roc_auc_survival standard 1.75 0.777\n#> 9 roc_auc_survival standard 2 0.771\n#> 10 roc_auc_survival standard 2.25 0.778\n#> # ℹ 75 more rows\n```\n:::\n\n\nOver time:\n\n\n::: {.cell layout-align=\"center\" hash='cache/roc-scores_2854d1506efb429887eefb1bf75e89be'}\n\n```{.r .cell-code}\nroc_scores %>% \n ggplot(aes(.eval_time, .estimate)) + \n geom_hline(yintercept = 1 / 2, col = \"red\", lty = 3) +\n geom_line() +\n geom_point() + \n labs(x = \"time\", y = \"ROC AUC\")\n```\n\n::: {.cell-output-display}\n![](figs/roc-scores-1.svg){fig-align='center' width=672}\n:::\n:::\n\n\nThe initial variation is due to so few events at the early stages of analysis. \n\nThe ROC measures the separation between classes and the Brier score focuses more on accurate and well-calibrated predictions. It should not be surprising that each metric's results over time differ. \n\n## Tuning these metrics\n\nMany of the event time models available in tidymodels have tuning parameters. The `tune_*()` functions and `fit_resamples()` have an `eval_time` argument used to pass the evaluation times. The statistics are computed for these time points using out-of-sample data. \n\nIn some cases, such as [iterative search](https://www.tmwr.org/iterative-search.html) or [racing methods](https://www.tmwr.org/grid-search.html#racing), the functions need a single value to optimize. If a dynamic metric is used to guide the optimization, _the first evaluation time given by the user_ will be used. \n\nFor example, if a model for these data was being optimized, and we wanted a time of 5.00 to guide the process, we would need to use that value of 5.00 as the first element `time_points`, the vector given to the `eval_time` argument in our example above.\n\n\n## Summary\n\ntidymodels has two time-dependent metrics for characterizing the performance of event-time models:\n\n* The Brier score measures the distance between the observed class result and the predicted probabilities. \n* ROC curves try to measure the separation between the two classes based on the survival probabilities. \n\n\n## Session information {#session-info}\n\n\n::: {.cell layout-align=\"center\" hash='cache/si_43a75b68dcc94565ba13180d7ad26a69'}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.3.0 (2023-04-21)\n#> os macOS Ventura 13.4\n#> system aarch64, darwin20\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/Los_Angeles\n#> date 2023-07-02\n#> pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.4 2023-03-11 [1] CRAN (R 4.3.0)\n#> censored * 0.2.0.9000 2023-07-02 [1] Github (tidymodels/censored@f9eccb6)\n#> dials * 1.2.0 2023-04-03 [1] CRAN (R 4.3.0)\n#> dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n#> ggplot2 * 3.4.2 2023-04-03 [1] CRAN (R 4.3.0)\n#> infer * 1.0.4 2022-12-02 [1] CRAN (R 4.3.0)\n#> parsnip * 1.1.0.9003 2023-07-02 [1] Github (tidymodels/parsnip@e6278b0)\n#> prodlim * 2023.03.31 2023-04-02 [1] CRAN (R 4.3.0)\n#> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)\n#> recipes * 1.0.6 2023-04-25 [1] CRAN (R 4.3.0)\n#> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n#> rsample * 1.1.1 2022-12-07 [1] CRAN (R 4.3.0)\n#> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n#> tidymodels * 1.1.0 2023-05-01 [1] CRAN (R 4.3.0)\n#> tune * 1.1.1 2023-04-11 [1] CRAN (R 4.3.0)\n#> workflows * 1.1.3 2023-02-22 [1] CRAN (R 4.3.0)\n#> yardstick * 1.2.0 2023-04-21 [1] CRAN (R 4.3.0)\n#> \n#> [1] /Users/emilhvitfeldt/Library/R/arm64/4.3/library\n#> [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/docs/learn/index.html b/docs/learn/index.html index 81d23658..9b8a0fdc 100644 --- a/docs/learn/index.html +++ b/docs/learn/index.html @@ -355,7 +355,7 @@ +
Categories
Bayesian optimization
SVMs
analysis of tables
bootstraping
broom
calibration
case weights
class imbalances
classification
clustering
confidence intervals
correlation
developer tools
dials
discriminant analysis
extracting results
hypothesis testing
linear regression
logistic regression
model fitting
model tuning
multivariate analysis
nested resampling
neural networks
parsnip
partial least squares
pre-processing
probably
random forests
recipes
resampling
rsample
statistical analysis
survival analysis
text analysis
tidying results
time series
torch
tune
tuning
workflows
yardstick
@@ -431,7 +431,35 @@
-
+ + -
+ -
+ -
+ -
+ -
+ -
+ -
+ -
+ -
+ + -
+ -
+ -
+ -
+ -
+ -
+ -
+ -
+ -
+ -
+ -
+ -
+ -
+ -
+ -
+ -
+

diff --git a/docs/learn/statistics/survival-metrics-details/figs/RKM-1.svg b/docs/learn/statistics/survival-metrics-details/figs/RKM-1.svg new file mode 100644 index 00000000..5552accd --- /dev/null +++ b/docs/learn/statistics/survival-metrics-details/figs/RKM-1.svg @@ -0,0 +1,1578 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +0.00 +0.25 +0.50 +0.75 +1.00 + + + + + + + + + + +0 +5 +10 +15 +20 +time +probability of being censored + + diff --git a/docs/learn/statistics/survival-metrics-details/figs/plot-graf-categories-1.svg b/docs/learn/statistics/survival-metrics-details/figs/plot-graf-categories-1.svg new file mode 100644 index 00000000..9af78453 --- /dev/null +++ b/docs/learn/statistics/survival-metrics-details/figs/plot-graf-categories-1.svg @@ -0,0 +1,212 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +1 + + + + + + + + + + +3 + + + + + + + + + + +5 + + + + + + +0 +2 +4 +6 + + + + +0 +2 +4 +6 + + + + +0 +2 +4 +6 + + + + + +Time +Sample + +Status + + + + + + + + + + + + +Observation: censored +Observation: event +Evaluation: non-event +Evaluation: event + + diff --git a/docs/learn/statistics/survival-metrics-details/figs/usable-data-1.svg b/docs/learn/statistics/survival-metrics-details/figs/usable-data-1.svg new file mode 100644 index 00000000..e0e78504 --- /dev/null +++ b/docs/learn/statistics/survival-metrics-details/figs/usable-data-1.svg @@ -0,0 +1,82 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +0 +100 +200 +300 +400 +500 + + + + + + + + + + + +0 +5 +10 +15 +20 +time +number of usable observations + + diff --git a/docs/learn/statistics/survival-metrics-details/index.html b/docs/learn/statistics/survival-metrics-details/index.html new file mode 100644 index 00000000..64a2f77d --- /dev/null +++ b/docs/learn/statistics/survival-metrics-details/index.html @@ -0,0 +1,793 @@ + + + + + + + + + + +tidymodels - Accounting for Censoring in Performance Metrics for Event Time Data + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+ + + + +
+ +
+
+

Accounting for Censoring in Performance Metrics for Event Time Data

+
+
statistical analysis
+
survival analysis
+
+
+ +
+
+

Learn how tidymodels uses causal inference tools to measure performance of survival models.

+
+
+ + +
+ + + + +
+ + +
+ +
+

Introduction

+

To use code in this article, you will need to install the following packages: censored, prodlim, and tidymodels. You’ll need the development versions of censored and parsnip. To install these, use

+
#install.packages("pak")
+
+pak::pak(c("tidymodels/censored", "tidymodels/parsnip"))
+

One trend in modern survival analysis is to compute time-dependent measures of performance. These are primarily driven by an increased focus on predictions for the probability of survival at a given time (as opposed to the predictions of event times or linear predictors). Since these are conditional on the time of evaluation, we call them dynamic performance metrics.

+

Many dynamic metrics are similar to those used for binary classification models. Examples include the Brier score and ROC curves (see the Dynamic Performance Metrics for Event Time Data article for details). The basic idea is that, for a given time \(t\) for model evaluation, we try to encode the observed event time data into a binary “has there been an event at time \(t\)?” version. We also convert the predicted survival probabilities into predicted events/non-events based on a threshold (default is 0.50). The survival versions of these metrics need those binary versions of observed truth and predictions as well as a way to account for censoring.

+

Censoring plays into the details of the conversion and is additionally captured in the form of weights. This article covers both those aspects in detail to complement the main article on performance metrics for event time data.

+

To start, let’s define the various types of times that will be mentioned:

+
    +
  • Observed time: time recorded in the data
  • +
  • Event time: observed times for actual events
  • +
  • Evaluation time: the time, specified by the analyst, that the model is evaluated.
  • +
+
+
+

Example data

+

As an example, we’ll simulate some data with the prodlim package, using the methods of Bender et al (2005). A training and a validation set are simulated. We’ll also load the censored package so that we can fit a model to these time-to-event data:

+
+
library(tidymodels)
+library(censored)
+#> Loading required package: survival
+library(prodlim)
+
+set.seed(5882)
+sim_dat <- SimSurv(2000) %>%
+  mutate(event_time = Surv(time, event)) %>%
+  select(event_time, X1, X2)
+
+set.seed(2)
+split   <- initial_split(sim_dat)
+sim_tr  <- training(split)
+sim_val <- testing(split)
+
+

We’ll need a model to illustrate the code and concepts. Let’s fit a bagged survival tree model to the training set:

+
+
set.seed(17)
+bag_tree_fit <- 
+  bag_tree() %>% 
+  set_mode("censored regression") %>% 
+  set_engine("rpart") %>% 
+  fit(event_time ~ ., data = sim_tr)
+bag_tree_fit
+#> parsnip model object
+#> 
+#> 
+#> Bagging survival trees with 25 bootstrap replications 
+#> 
+#> Call: bagging.data.frame(formula = event_time ~ ., data = data)
+
+

Using this model, we can make predictions of different types and augment() provides us with a version of the data augmented with the various predictions. Here we are interested in the predicted probability of survival at different evaluation time points. The largest event time in the training set is 21.083 so we will use a set of evaluation times between zero and 21.

+
+
time_points <- seq(0, 21, by = 0.25)
+
+val_pred <- augment(bag_tree_fit, sim_val, eval_time = time_points)
+val_pred
+#> # A tibble: 500 × 5
+#>    .pred             .pred_time event_time    X1      X2
+#>    <list>                 <dbl>     <Surv> <dbl>   <dbl>
+#>  1 <tibble [85 × 5]>       6.66  4.831957      1 -0.630 
+#>  2 <tibble [85 × 5]>       6.66  6.110031      1 -0.606 
+#>  3 <tibble [85 × 5]>       7.47  6.597774+     1 -1.03  
+#>  4 <tibble [85 × 5]>       3.29  2.717484      1  0.811 
+#>  5 <tibble [85 × 5]>       5.10  4.727042+     1 -0.376 
+#>  6 <tibble [85 × 5]>       4.99  8.699061      0  1.18  
+#>  7 <tibble [85 × 5]>       7.23 10.818670      1 -0.851 
+#>  8 <tibble [85 × 5]>       6.46  6.886378      0  0.493 
+#>  9 <tibble [85 × 5]>       4.75  2.451893+     1  0.0207
+#> 10 <tibble [85 × 5]>      13.4   8.231911+     0 -1.52  
+#> # ℹ 490 more rows
+
+

The observed data are in the event_time column. The predicted survival probabilities are in the .pred column. This is a list column with a data frame for each observation, containing the predictions at the 85 evaluation time points in the (nested) column .pred_survival.

+
+
val_pred$.pred[[1]]
+#> # A tibble: 85 × 5
+#>    .eval_time .pred_survival .weight_time .pred_censored .weight_censored
+#>         <dbl>          <dbl>        <dbl>          <dbl>            <dbl>
+#>  1       0             1            0              1                 1   
+#>  2       0.25          1            0.250          0.999             1.00
+#>  3       0.5           0.999        0.500          0.996             1.00
+#>  4       0.75          0.992        0.750          0.993             1.01
+#>  5       1             0.988        1.00           0.991             1.01
+#>  6       1.25          0.980        1.25           0.987             1.01
+#>  7       1.5           0.972        1.50           0.981             1.02
+#>  8       1.75          0.959        1.75           0.971             1.03
+#>  9       2             0.938        2.00           0.966             1.04
+#> 10       2.25          0.925        2.25           0.959             1.04
+#> # ℹ 75 more rows
+
+

First, let’s dive into how to convert the observed event time in event_time to a binary version. Then we will discuss the remaining columns as part of generating the required weights for the dynamic performance metrics.

+
+
+

Converting censored data to binary data

+

To assess model performance at evaluation time \(t\), we turn the observed event time data into a binary “was there an event at time \(t\)?” version. For this, we follow the process described by Graf et al (1999) where observations are categorized into three groups, at evaluation time \(t\).

+
    +
  • Category 1 - Events: Evaluation time is greater than or equal to the event time (“it has already happened”).
  • +
  • Category 2 - Non-events: Evaluation time is less than the observed time, censored or not (“nothing has happened yet”).
  • +
  • Category 3 - Ambiguous outcomes: Evaluation time is greater than or equal to the observed censored time (“we don’t know if anything might have happened by now”).
  • +
+

We can use binary versions of the observations in the first two categories to compute binary performance metrics, but the observations in the third category are not used directly in these calculations. (They do influence the calculation of the weights, see next section.) So our usable sample size changes with the evaluation time.

+
+
+
+
+

+
+
+
+
+
+
+

Censoring weights

+

Unfortunately, this categorization scheme alone is not sufficient to compute metrics. Graf et al took a page from the causal inference literature and added a propensity-type score based on the likelihood that each data point would be censored (regardless of the observed event status). This is not the probability than the original time-to-event data point is censored but rather the probability that at evaluation time, we have not observed an event (or a censoring) yet, i.e., that the data point falls into category 2.

+

How do we compute this probability? The standard approach is to compute a “reverse Kaplan-Meier” (RKM) curve. Ordinarily, the Kaplan-Meier (KM) curve models the probability of survival. For the reverse Kaplan-Meier curve, the meaning of the status indicator is flipped, i.e., the event of interest changes from “death” to “censoring”. This should give us a reasonably reliable non-parametric model for estimating the probability of being censored at a given time.

+

Every time a censored regression model is created using tidymodels, the RKM is estimated on the same data used to fit the model and attached to the parsnip object.

+

For our simulated data, here is what the RKM curve looks like:

+
+
+
+
+

+
+
+
+
+

The red rug on the bottom shows the training point event times and the blue values at the top represent the times for the censored training set observations. As (evaluation) time increases, we pass more and more observed time points, and the probability of being censored, i.e., the probability of an observation to fall into category 2, decreases.

+

The weights used in the calculation of the dynamic performance metrics are the inverse of these probabilities, hence the name “inverse probability of censoring weights” (IPCW). These weights should theoretically balance the exposure/effect/influence that the definitive observations have on performance calculations.

+
+

The finer details

+

First, when do we evaluate the probability of censoring? There are different approaches used in the literature, and we follow what Graf et al suggest (as it seems most appropriate):

+
    +
  • If the evaluation time is less than the observed time (like in category 2), the evaluation time is used to predict the probability of censoring.
  • +
  • If the evaluation is greater than or equal to the event time (like in category 1), the event time is used to predict the probability of censoring.
  • +
  • If the evaluation time is greater than or equal to the observed censoring time, the observation falls into category 3 and is not used, i.e., also no weight is needed.
  • +
+

We call this time at which to predict the probability of censoring the weight time. Here’s an example using the first data point in the validation set:

+
+
dyn_val_pred <- 
+  val_pred %>% 
+  select(.pred, event_time) %>% 
+  add_rowindex() %>% 
+  unnest(.pred) 
+
+dyn_val_pred %>% 
+  filter(.row == 1 & .eval_time %in% c(1, 2, 4, 5, 10)) %>% 
+  select(event_time, .eval_time, .weight_time, .pred_censored, .weight_censored)
+#> # A tibble: 5 × 5
+#>   event_time .eval_time .weight_time .pred_censored .weight_censored
+#>       <Surv>      <dbl>        <dbl>          <dbl>            <dbl>
+#> 1   4.831957          1         1.00          0.991             1.01
+#> 2   4.831957          2         2.00          0.966             1.04
+#> 3   4.831957          4         4.00          0.848             1.18
+#> 4   4.831957          5         4.83          0.786             1.27
+#> 5   4.831957         10         4.83          0.786             1.27
+
+

This observation was an event, observed at time 4.832 The column .weight_time captures at which time the probability of censoring was calculated. Up until that event time, the probability of being censored is computed at the evaluation time. After that, it is based on the event time.

+

We add a slight modification to the weight time: If our evaluation time is today, we don’t have today’s data yet. In tidymodels, we calculate the probability of censoring just before the requested weight time. We are basically subtracting a small numerical value from the weight time used in the RKM model. You’ll only really see a difference if there is a bulk of censored observations at the original evaluation time.

+

Finally, we use a simple RKM curve (i.e., no covariates or strata). This implies that there is non-informative censoring. Other applications of IPCW try to mitigate the effects of informative censoring. In the future, we may allow the censoring model to include covariates (as well as models beyond the RKM).

+
+
+
+

Illustration: Confusion matrix

+

To illustrate how these two tools for accounting for censoring are used in calculating dynamic performance metrics, we’ll take a look here at the 2x2 confusion matrices at a few evaluation time points. More details on performance metrics for censored data are in the aforementioned Dynamic Performance Metrics for Event Time Data article.

+

First, let’s turn the observed event time data and the predictions into their binary versions.

+
+
time_as_binary_event <- function(surv, eval_time) {
+  event_time <- .extract_surv_time(surv)
+  status <- .extract_surv_status(surv)
+  is_event_before_t <- event_time <= eval_time & status == 1
+  
+  # Three possible contributions to the statistic from Graf 1999
+  # Censoring time before eval_time, no contribution (Graf category 3)
+  binary_res <- rep(NA_character_, length(event_time))
+  
+  # A real event prior to eval_time (Graf category 1)
+  binary_res <- if_else(is_event_before_t, "event", binary_res)
+  
+  # Observed time greater than eval_time (Graf category 2)
+  binary_res <- if_else(event_time > eval_time, "non-event", binary_res)
+  factor(binary_res, levels = c("event", "non-event"))
+}
+
+binary_encoding <- 
+  dyn_val_pred %>% 
+  mutate(
+    obs_class = time_as_binary_event(event_time, .eval_time),
+    pred_class = if_else(.pred_survival >= 1 / 2, "non-event", "event"),
+    pred_class = factor(pred_class, levels = c("event", "non-event")),
+  )
+
+

Remember how observations falling into category 3 are removed from the analysis? This means we’ll likely have fewer data points to evaluate as the evaluation time increases. This implies that the variation in the metrics will be considerable as evaluation time goes on. For our simulated training set:

+
+
dyn_val_pred %>% 
+  summarize(num_usable = sum(!is.na(.weight_censored)), .by = c(.eval_time)) %>% 
+  ggplot() + 
+  geom_step(aes(.eval_time, num_usable)) +
+  labs(x = "time", y = "number of usable observations") +
+  lims(y = c(0, nrow(sim_val))) +
+  theme_bw()
+
+
+
+

+
+
+
+
+

Keeping this in mind, let’s look at what happens with the data points we can use. Let’s start with an evaluation time of 1.00. To compute the confusion matrix for a classification problem, we would simply use:

+
binary_encoding %>% 
+  filter(.eval_time == 1.00) %>% 
+  conf_mat(truth = obs_class, estimate = pred_class)
+

For censored regression problems, we need to additionally use the censoring weights so we’ll include them via the case_weights argument:

+
+
binary_encoding %>%
+  filter(.eval_time == 1.00) %>%
+  conf_mat(truth = obs_class,
+           estimate = pred_class,
+           case_weights = .weight_censored)
+#>            Truth
+#> Prediction      event non-event
+#>   event       0.00000   0.00000
+#>   non-event  14.11046 482.54963
+
+

The values in the cells are the sum of the censoring weights, There are 14 actual events (out of 492 usable observations) before this evaluation time, so there are empty cells. Also note that the cell values are close to the actual counts. This early, the predicted censoring probabilities are very close to one so their inverse values are also.

+

This early, performance looks very good but that is mostly because there are few events.

+

Let’s shift to an evaluation time of 5.0.

+
+
binary_encoding %>%
+  filter(.eval_time == 5.00) %>%
+  conf_mat(truth = obs_class,
+           estimate = pred_class,
+           case_weights = .weight_censored)
+#>            Truth
+#> Prediction      event non-event
+#>   event     112.98854  54.36531
+#>   non-event  56.14133 252.41039
+
+

Now we have fewer total observations to consider (391 instead of 492 usable values) and more events (154 this time). Performance is fairly good; the sensitivity is 66.8% and the specificty is 82.3%.

+

What happends when the evaluation time is 17?

+
+
binary_encoding %>%
+  filter(.eval_time == 17.00) %>%
+  conf_mat(truth = obs_class,
+           estimate = pred_class,
+           case_weights = .weight_censored)
+#>            Truth
+#> Prediction     event non-event
+#>   event     429.9920  123.4458
+#>   non-event   0.0000    0.0000
+
+

The data are overwhelmingly events. Also, the censoring weights are larger now since the probability of being censored is very low. The mean censoring weight is 1.96.

+

This concludes the illustration of how to account for censoring when using a confusion matrix for performance assessment. There’s more on dynamic performance metrics in the Dynamic Performance Metrics for Event Time Data article.

+
+
+

Summary

+

When accounting for censoring in dynamic performance metrics, the main points to remember are:

+
    +
  • Event time data can be converted to a binary format.
  • +
  • Some data points cannot be used in the calculations.
  • +
  • To properly estimate statistical quantities, we weight the computations by the inverse of the probability of being censored.
  • +
  • tidymodels currently assumes non-informative censoring.
  • +
+
+
+

Session information

+
+
#> ─ Session info ─────────────────────────────────────────────────────
+#>  setting  value
+#>  version  R version 4.3.0 (2023-04-21)
+#>  os       macOS Ventura 13.4
+#>  system   aarch64, darwin20
+#>  ui       X11
+#>  language (EN)
+#>  collate  en_US.UTF-8
+#>  ctype    en_US.UTF-8
+#>  tz       America/Los_Angeles
+#>  date     2023-07-02
+#>  pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
+#> 
+#> ─ Packages ─────────────────────────────────────────────────────────
+#>  package    * version    date (UTC) lib source
+#>  broom      * 1.0.4      2023-03-11 [1] CRAN (R 4.3.0)
+#>  censored   * 0.2.0.9000 2023-07-02 [1] Github (tidymodels/censored@f9eccb6)
+#>  dials      * 1.2.0      2023-04-03 [1] CRAN (R 4.3.0)
+#>  dplyr      * 1.1.2      2023-04-20 [1] CRAN (R 4.3.0)
+#>  ggplot2    * 3.4.2      2023-04-03 [1] CRAN (R 4.3.0)
+#>  infer      * 1.0.4      2022-12-02 [1] CRAN (R 4.3.0)
+#>  parsnip    * 1.1.0.9003 2023-07-02 [1] Github (tidymodels/parsnip@e6278b0)
+#>  prodlim    * 2023.03.31 2023-04-02 [1] CRAN (R 4.3.0)
+#>  purrr      * 1.0.1      2023-01-10 [1] CRAN (R 4.3.0)
+#>  recipes    * 1.0.6      2023-04-25 [1] CRAN (R 4.3.0)
+#>  rlang        1.1.1      2023-04-28 [1] CRAN (R 4.3.0)
+#>  rsample    * 1.1.1      2022-12-07 [1] CRAN (R 4.3.0)
+#>  tibble     * 3.2.1      2023-03-20 [1] CRAN (R 4.3.0)
+#>  tidymodels * 1.1.0      2023-05-01 [1] CRAN (R 4.3.0)
+#>  tune       * 1.1.1      2023-04-11 [1] CRAN (R 4.3.0)
+#>  workflows  * 1.1.3      2023-02-22 [1] CRAN (R 4.3.0)
+#>  yardstick  * 1.2.0      2023-04-21 [1] CRAN (R 4.3.0)
+#> 
+#>  [1] /Users/emilhvitfeldt/Library/R/arm64/4.3/library
+#>  [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
+#> 
+#> ────────────────────────────────────────────────────────────────────
+
+ + +
+ +
+ +
+
Resources
+
+
+   Find +
+
Explore searchable tables of all tidymodels packages and functions.
+
+
+
+   Books +
+
Study up on statistics and modeling with our comprehensive books.
+
+
+
+   News +
+
Hear the latest about tidymodels packages at the tidyverse blog.
+
+
+ + + +
+
+ +
+ + + + \ No newline at end of file diff --git a/docs/learn/statistics/survival-metrics/figs/brier-scores-1.svg b/docs/learn/statistics/survival-metrics/figs/brier-scores-1.svg new file mode 100644 index 00000000..1693b1e4 --- /dev/null +++ b/docs/learn/statistics/survival-metrics/figs/brier-scores-1.svg @@ -0,0 +1,168 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +0.00 +0.05 +0.10 +0.15 +0.20 +0.25 + + + + + + + + + + + +0 +5 +10 +15 +20 +time +Brier score + + diff --git a/docs/learn/statistics/survival-metrics/figs/roc-5-1.svg b/docs/learn/statistics/survival-metrics/figs/roc-5-1.svg new file mode 100644 index 00000000..c85ee4e0 --- /dev/null +++ b/docs/learn/statistics/survival-metrics/figs/roc-5-1.svg @@ -0,0 +1,88 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +0.00 +0.25 +0.50 +0.75 +1.00 + + + + + + + + + + +0.00 +0.25 +0.50 +0.75 +1.00 +1 - specificity +sensitivity + + diff --git a/docs/learn/statistics/survival-metrics/figs/roc-scores-1.svg b/docs/learn/statistics/survival-metrics/figs/roc-scores-1.svg new file mode 100644 index 00000000..06015ad7 --- /dev/null +++ b/docs/learn/statistics/survival-metrics/figs/roc-scores-1.svg @@ -0,0 +1,165 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +0.5 +0.6 +0.7 +0.8 +0.9 + + + + + + + + + + +0 +5 +10 +15 +20 +time +ROC AUC + + diff --git a/docs/learn/statistics/survival-metrics/figs/surv-hist-05-1.svg b/docs/learn/statistics/survival-metrics/figs/surv-hist-05-1.svg new file mode 100644 index 00000000..5aa4e8ab --- /dev/null +++ b/docs/learn/statistics/survival-metrics/figs/surv-hist-05-1.svg @@ -0,0 +1,190 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +non-event + + + + + + + + + + +event + + + + + + + +0.00 +0.25 +0.50 +0.75 +1.00 +0 +25 +50 +75 + + + + +0 +25 +50 +75 + + + + +probability of survival +sum of weights + + diff --git a/docs/learn/statistics/survival-metrics/index.html b/docs/learn/statistics/survival-metrics/index.html new file mode 100644 index 00000000..a8e5f78e --- /dev/null +++ b/docs/learn/statistics/survival-metrics/index.html @@ -0,0 +1,788 @@ + + + + + + + + + + +tidymodels - Dynamic Performance Metrics for Event Time Data + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+ + + + +
+ +
+
+

Dynamic Performance Metrics for Event Time Data

+
+
statistical analysis
+
survival analysis
+
+
+ +
+
+

Let’s discuss how to compute modern performance metrics for time-to-event models.

+
+
+ + +
+ + + + +
+ + +
+ +
+

Introduction

+

To use code in this article, you will need to install the following packages: censored, prodlim, and tidymodels. You’ll need the development versions of censored and parsnip. To install these, use

+
#install.packages("pak")
+
+pak::pak(c("tidymodels/censored", "tidymodels/parsnip"))
+

One trend in modern survival analysis is to compute time-dependent measures of performance. These are primarily driven by an increased focus on predictions for the probability of survival at a given time (as opposed to the predictions of event times or linear predictors). Since these are conditional on the time of evaluation, we call them dynamic performance metrics.

+

Many dynamic metrics are similar to those used for binary classification models. The basic idea is that, for a given time \(t\) for model evaluation, we try to encode the observed event time data into a binary “has there been an event at time \(t\)?” version. We can also convert the predicted survival probabilities into predicted events/non-events based on a threshold (default is 0.50). The survival versions of these metrics need those binary versions of observed truth and predictions as well as a way to account for censoring.

+

Censoring plays into the details of the conversion and is additionally captured in the form of weights. For details on both these aspects, see the Accounting for Censoring in Performance Metrics for Event Time Data article.

+

To start, let’s define the various types of times that will be mentioned:

+
    +
  • Observed time: time recorded in the data
  • +
  • Event time: observed times for actual events
  • +
  • Evaluation time: the time, specified by the analyst, that the model is evaluated.
  • +
+
+
+

Example data

+

As an example, we’ll simulate some data with the prodlim package, using the methods of Bender et al (2005). A training and a validation set are simulated. We’ll also load the censored package so that we can fit a model to these time-to-event data:

+
+
library(tidymodels)
+library(censored)
+#> Loading required package: survival
+library(prodlim)
+
+set.seed(5882)
+sim_dat <- SimSurv(2000) %>%
+  mutate(event_time = Surv(time, event)) %>%
+  select(event_time, X1, X2)
+
+set.seed(2)
+split   <- initial_split(sim_dat)
+sim_tr  <- training(split)
+sim_val <- testing(split)
+
+## Resampling object
+sim_rs <- vfold_cv(sim_tr)
+
+

We’ll need a model to illustrate the code and concepts. Let’s fit a bagged survival tree model to the training set:

+
+
set.seed(17)
+bag_tree_fit <- 
+  bag_tree() %>% 
+  set_mode("censored regression") %>% 
+  set_engine("rpart") %>% 
+  fit(event_time ~ ., data = sim_tr)
+bag_tree_fit
+#> parsnip model object
+#> 
+#> 
+#> Bagging survival trees with 25 bootstrap replications 
+#> 
+#> Call: bagging.data.frame(formula = event_time ~ ., data = data)
+
+

Using this model, we’ll make predictions of different types.

+
+
+

Survival Probability Prediction

+

This censored regression model can make static predictions via the predicted event time using predict(object, type = "time"). It can also create dynamic predictions regarding the probability of survival for each data point at specific times. The syntax for this is

+
predict(object, new_data, type = "survival", eval_time = numeric())
+

where eval_time is a vector of time points at which we want the corresponding survivor function estimates. Alternatively, we can use the augment() function to get both types of prediction and automatically attach them to the data.

+

The largest event time in the training set is 21.083 so we will use a set of evaluation times between zero and 21.

+
+
time_points <- seq(0, 21, by = 0.25)
+
+val_pred <- augment(bag_tree_fit, sim_val, eval_time = time_points)
+val_pred
+#> # A tibble: 500 × 5
+#>    .pred             .pred_time event_time    X1      X2
+#>    <list>                 <dbl>     <Surv> <dbl>   <dbl>
+#>  1 <tibble [85 × 5]>       6.66  4.831957      1 -0.630 
+#>  2 <tibble [85 × 5]>       6.66  6.110031      1 -0.606 
+#>  3 <tibble [85 × 5]>       7.47  6.597774+     1 -1.03  
+#>  4 <tibble [85 × 5]>       3.29  2.717484      1  0.811 
+#>  5 <tibble [85 × 5]>       5.10  4.727042+     1 -0.376 
+#>  6 <tibble [85 × 5]>       4.99  8.699061      0  1.18  
+#>  7 <tibble [85 × 5]>       7.23 10.818670      1 -0.851 
+#>  8 <tibble [85 × 5]>       6.46  6.886378      0  0.493 
+#>  9 <tibble [85 × 5]>       4.75  2.451893+     1  0.0207
+#> 10 <tibble [85 × 5]>      13.4   8.231911+     0 -1.52  
+#> # ℹ 490 more rows
+
+

The observed data are in the event_time column. The predicted survival probabilities are in the .pred column. This is a list column with a data frame for each observation, containing the predictions at the 85 evaluation time points in the (nested) column .pred_survival.

+
+
val_pred$.pred[[1]]
+#> # A tibble: 85 × 5
+#>    .eval_time .pred_survival .weight_time .pred_censored .weight_censored
+#>         <dbl>          <dbl>        <dbl>          <dbl>            <dbl>
+#>  1       0             1            0              1                 1   
+#>  2       0.25          1            0.250          0.999             1.00
+#>  3       0.5           0.999        0.500          0.996             1.00
+#>  4       0.75          0.992        0.750          0.993             1.01
+#>  5       1             0.988        1.00           0.991             1.01
+#>  6       1.25          0.980        1.25           0.987             1.01
+#>  7       1.5           0.972        1.50           0.981             1.02
+#>  8       1.75          0.959        1.75           0.971             1.03
+#>  9       2             0.938        2.00           0.966             1.04
+#> 10       2.25          0.925        2.25           0.959             1.04
+#> # ℹ 75 more rows
+
+

The yardstick package currently has two dynamic metrics. Each is described below.

+
+
+

Brier Score

+

The Brier score is a metric that can be used with both classification and event-time models. For classification models, it computes the squared error between the observed outcome (encoded as 0/1) and the corresponding predicted probability for the class.

+

A little math: suppose that the value \(y_{ik}\) is a 0/1 indicator for whether the observed outcome \(i\) corresponds to class \(k\), and \(\hat{p}_{ik}\) is the estimated class probability. The classification score is then:

+

\[ +Brier_{class} = \frac{1}{N}\sum_{i=1}^N\sum_{k=1}^C (y_{ik} - \hat{p}_{ik})^2 +\]

+

For survival models, we transform the event time data into a binary version \(y_{it}\): is there an event at evaluation time \(t\)1. The survival function estimate \(\hat{p}_{it}\) is the probability that corresponds to non-events at time \(t\). For example, if there has not been an event at the current evaluation time, our best model should estimate the survival probability to be near one. For observations that are events, the probability estimate is just one minus the survivor estimate. To account for censoring, we also weight each observation with \(w_{it}\). The time-dependent Brier score is:

+

\[ +Brier_{surv}(t) = \frac{1}{N}\sum_{i=1}^N w_{it}\left[\underbrace{I(y_{it} = 0)(y_{it} - \hat{p}_{it})^2}_\text{non-events} + \underbrace{I(y_{it} = 1)(y_{it} - (1 - \hat{p}_{it}))^2}_\text{events}\right] +\]

+

For this score, a perfect model has a score of zero, while an uninformative model would have a score of around 1/4.

+

How do we compute this using the yardstick package? The function brier_survival() follows the same convention as the other metric functions. The main arguments are:

+
    +
  • data: the data frame with the predictions (structured as the output produced by augment(), as shown above).
  • +
  • truth: the name of the column with the Surv object.
  • +
  • ...: the name of the column with the dynamic predictions. Within tidymodels, this column is always called .pred. In other words, .pred should be passed without an argument name.
  • +
+

Since the evaluation times and the case weights are within the .pred column, there is no need to specify these. Here are the results of our validation set:

+
+
brier_scores <-
+  val_pred %>% 
+  brier_survival(truth = event_time, .pred)
+brier_scores
+#> # A tibble: 85 × 4
+#>    .metric        .estimator .eval_time .estimate
+#>    <chr>          <chr>           <dbl>     <dbl>
+#>  1 brier_survival standard         0      0      
+#>  2 brier_survival standard         0.25   0      
+#>  3 brier_survival standard         0.5    0.00202
+#>  4 brier_survival standard         0.75   0.00796
+#>  5 brier_survival standard         1      0.0266 
+#>  6 brier_survival standard         1.25   0.0402 
+#>  7 brier_survival standard         1.5    0.0563 
+#>  8 brier_survival standard         1.75   0.0785 
+#>  9 brier_survival standard         2      0.0895 
+#> 10 brier_survival standard         2.25   0.0951 
+#> # ℹ 75 more rows
+
+

Over time:

+
+
brier_scores %>% 
+  ggplot(aes(.eval_time, .estimate)) + 
+  geom_hline(yintercept = 1 / 4, col = "red", lty = 3) +
+  geom_line() +
+  geom_point() + 
+  labs(x = "time", y = "Brier score")
+
+
+
+

+
+
+
+
+

There is also an integrated Brier score. This required evaluation times as inputs but instead of returning each result, it takes the area under the plot shown above. The syntax is the same but the result has a single row:

+
+
val_pred %>% brier_survival_integrated(truth = event_time, .pred)
+#> # A tibble: 1 × 3
+#>   .metric                   .estimator .estimate
+#>   <chr>                     <chr>          <dbl>
+#> 1 brier_survival_integrated standard       0.113
+
+

Again, smaller values are better.

+
+
+

Receiver Operating Characteristic (ROC) Curves

+

When we not only turn the event time data into a binary representation but also the predicted probabilities, we are in well-chartered classification metrics territory. Sensitivity and specificity are common quantities to compute, we do so here in their weighted version to account for censoring:

+
    +
  • Sensitivity: How well do we predict the events? This is analogous to the true positive rate.
  • +
  • Specificity: How well do we predict the non-events? One minus specificity is the false positive rate.
  • +
+

These depend on the threshold used to turn predicted probabilities into predicted events/non-events. Let’s take a look at the distribution of the survival probabilities for our example data at an evaluation time of 5.00. The distributions are separated by the observed class and weighted by the censoring weights. Details of both aspects are the same as for the Brier score and can be found in the Accounting for Censoring in Performance Metrics for Event Time Data article.

+
+
+
+
+

+
+
+
+
+

More probability values are to the right of the 50% cutoff for the true non-events. Conversely, true events tend to have smaller probabilities. Using this cutoff, the sensitivity would be 66.8% and the specificity would be 82.3%. There are other possible cutoffs for the survival probabilities. Maybe one of these would have better statistics.

+

ROC curves were designed as a general method that, given a collection of continuous predictions, determines an effective threshold such that values above the threshold indicate a specific event. For our purposes, the ROC curve will compute the sensitivity and specificity for all possible probability thresholds. It then plots the true positive rate versus the false positive rate. Generally, we use the area under the ROC curve to quantify it with a single number. Values near one indicate a perfect model, while values near 1/2 occur with non-informative models.

+

Blanche et al (2013) gives a good overview of ROC curves for survival analysis and their Section 4.3 is most relevant here.

+

For our example at evaluation time \(t = 5.00\), the ROC curve is:

+
+
+
+
+

+
+
+
+
+

The area under this curve is 0.807.

+

Since this is a dynamic metric, we compute the AUC for each evaluation time. The syntax is very similar to the Brier code shown above:

+
+
roc_scores <-
+  val_pred %>% 
+  roc_auc_survival(truth = event_time, .pred)
+roc_scores
+#> # A tibble: 85 × 4
+#>    .metric          .estimator .eval_time .estimate
+#>    <chr>            <chr>           <dbl>     <dbl>
+#>  1 roc_auc_survival standard         0        0.5  
+#>  2 roc_auc_survival standard         0.25     0.5  
+#>  3 roc_auc_survival standard         0.5      0.869
+#>  4 roc_auc_survival standard         0.75     0.852
+#>  5 roc_auc_survival standard         1        0.734
+#>  6 roc_auc_survival standard         1.25     0.768
+#>  7 roc_auc_survival standard         1.5      0.792
+#>  8 roc_auc_survival standard         1.75     0.777
+#>  9 roc_auc_survival standard         2        0.771
+#> 10 roc_auc_survival standard         2.25     0.778
+#> # ℹ 75 more rows
+
+

Over time:

+
+
roc_scores %>% 
+  ggplot(aes(.eval_time, .estimate)) + 
+  geom_hline(yintercept = 1 / 2, col = "red", lty = 3) +
+  geom_line() +
+  geom_point() + 
+  labs(x = "time", y = "ROC AUC")
+
+
+
+

+
+
+
+
+

The initial variation is due to so few events at the early stages of analysis.

+

The ROC measures the separation between classes and the Brier score focuses more on accurate and well-calibrated predictions. It should not be surprising that each metric’s results over time differ.

+
+
+

Tuning these metrics

+

Many of the event time models available in tidymodels have tuning parameters. The tune_*() functions and fit_resamples() have an eval_time argument used to pass the evaluation times. The statistics are computed for these time points using out-of-sample data.

+

In some cases, such as iterative search or racing methods, the functions need a single value to optimize. If a dynamic metric is used to guide the optimization, the first evaluation time given by the user will be used.

+

For example, if a model for these data was being optimized, and we wanted a time of 5.00 to guide the process, we would need to use that value of 5.00 as the first element time_points, the vector given to the eval_time argument in our example above.

+
+
+

Summary

+

tidymodels has two time-dependent metrics for characterizing the performance of event-time models:

+
    +
  • The Brier score measures the distance between the observed class result and the predicted probabilities.
  • +
  • ROC curves try to measure the separation between the two classes based on the survival probabilities.
  • +
+
+
+

Session information

+
+
#> ─ Session info ─────────────────────────────────────────────────────
+#>  setting  value
+#>  version  R version 4.3.0 (2023-04-21)
+#>  os       macOS Ventura 13.4
+#>  system   aarch64, darwin20
+#>  ui       X11
+#>  language (EN)
+#>  collate  en_US.UTF-8
+#>  ctype    en_US.UTF-8
+#>  tz       America/Los_Angeles
+#>  date     2023-07-02
+#>  pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
+#> 
+#> ─ Packages ─────────────────────────────────────────────────────────
+#>  package    * version    date (UTC) lib source
+#>  broom      * 1.0.4      2023-03-11 [1] CRAN (R 4.3.0)
+#>  censored   * 0.2.0.9000 2023-07-02 [1] Github (tidymodels/censored@f9eccb6)
+#>  dials      * 1.2.0      2023-04-03 [1] CRAN (R 4.3.0)
+#>  dplyr      * 1.1.2      2023-04-20 [1] CRAN (R 4.3.0)
+#>  ggplot2    * 3.4.2      2023-04-03 [1] CRAN (R 4.3.0)
+#>  infer      * 1.0.4      2022-12-02 [1] CRAN (R 4.3.0)
+#>  parsnip    * 1.1.0.9003 2023-07-02 [1] Github (tidymodels/parsnip@e6278b0)
+#>  prodlim    * 2023.03.31 2023-04-02 [1] CRAN (R 4.3.0)
+#>  purrr      * 1.0.1      2023-01-10 [1] CRAN (R 4.3.0)
+#>  recipes    * 1.0.6      2023-04-25 [1] CRAN (R 4.3.0)
+#>  rlang        1.1.1      2023-04-28 [1] CRAN (R 4.3.0)
+#>  rsample    * 1.1.1      2022-12-07 [1] CRAN (R 4.3.0)
+#>  tibble     * 3.2.1      2023-03-20 [1] CRAN (R 4.3.0)
+#>  tidymodels * 1.1.0      2023-05-01 [1] CRAN (R 4.3.0)
+#>  tune       * 1.1.1      2023-04-11 [1] CRAN (R 4.3.0)
+#>  workflows  * 1.1.3      2023-02-22 [1] CRAN (R 4.3.0)
+#>  yardstick  * 1.2.0      2023-04-21 [1] CRAN (R 4.3.0)
+#> 
+#>  [1] /Users/emilhvitfeldt/Library/R/arm64/4.3/library
+#>  [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
+#> 
+#> ────────────────────────────────────────────────────────────────────
+
+ + +
+ + +

Footnotes

+ +
    +
  1. Again, see the Accounting for Censoring in Performance Metrics for Event Time Data article for more on this.↩︎

  2. +
+
+ +
+
Resources
+
+
+   Find +
+
Explore searchable tables of all tidymodels packages and functions.
+
+
+
+   Books +
+
Study up on statistics and modeling with our comprehensive books.
+
+
+
+   News +
+
Hear the latest about tidymodels packages at the tidyverse blog.
+
+
+ + + +
+
+ +
+ + + + \ No newline at end of file diff --git a/docs/listings.json b/docs/listings.json index 1ae7b8c0..19774780 100644 --- a/docs/listings.json +++ b/docs/listings.json @@ -3,6 +3,7 @@ "listing": "/learn/index.html", "items": [ "/start/case-study/index.html", + "/learn/statistics/survival-metrics-details/index.html", "/learn/models/calibration/index.html", "/learn/statistics/bootstrap/index.html", "/start/models/index.html", @@ -12,6 +13,7 @@ "/learn/develop/recipes/index.html", "/learn/work/case-weights/index.html", "/learn/develop/metrics/index.html", + "/learn/statistics/survival-metrics/index.html", "/start/resampling/index.html", "/learn/develop/models/index.html", "/learn/develop/parameters/index.html", diff --git a/docs/search.json b/docs/search.json index 3dfb3a29..e3063dbd 100644 --- a/docs/search.json +++ b/docs/search.json @@ -300,6 +300,76 @@ "section": "Session information", "text": "Session information\n\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.3.0 (2023-04-21)\n#> os macOS Ventura 13.4\n#> system aarch64, darwin20\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/Los_Angeles\n#> date 2023-07-02\n#> pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.4 2023-03-11 [1] CRAN (R 4.3.0)\n#> dials * 1.2.0 2023-04-03 [1] CRAN (R 4.3.0)\n#> dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n#> ggplot2 * 3.4.2 2023-04-03 [1] CRAN (R 4.3.0)\n#> infer * 1.0.4 2022-12-02 [1] CRAN (R 4.3.0)\n#> parsnip * 1.1.0 2023-04-12 [1] CRAN (R 4.3.0)\n#> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)\n#> recipes * 1.0.6 2023-04-25 [1] CRAN (R 4.3.0)\n#> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n#> rsample * 1.1.1 2022-12-07 [1] CRAN (R 4.3.0)\n#> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n#> tidymodels * 1.1.0 2023-05-01 [1] CRAN (R 4.3.0)\n#> tune * 1.1.1 2023-04-11 [1] CRAN (R 4.3.0)\n#> workflows * 1.1.3 2023-02-22 [1] CRAN (R 4.3.0)\n#> yardstick * 1.2.0 2023-04-21 [1] CRAN (R 4.3.0)\n#> \n#> [1] /Users/emilhvitfeldt/Library/R/arm64/4.3/library\n#> [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────" }, + { + "objectID": "learn/statistics/survival-metrics/index.html", + "href": "learn/statistics/survival-metrics/index.html", + "title": "Dynamic Performance Metrics for Event Time Data", + "section": "", + "text": "To use code in this article, you will need to install the following packages: censored, prodlim, and tidymodels. You’ll need the development versions of censored and parsnip. To install these, use\n#install.packages(\"pak\")\n\npak::pak(c(\"tidymodels/censored\", \"tidymodels/parsnip\"))\nOne trend in modern survival analysis is to compute time-dependent measures of performance. These are primarily driven by an increased focus on predictions for the probability of survival at a given time (as opposed to the predictions of event times or linear predictors). Since these are conditional on the time of evaluation, we call them dynamic performance metrics.\nMany dynamic metrics are similar to those used for binary classification models. The basic idea is that, for a given time \\(t\\) for model evaluation, we try to encode the observed event time data into a binary “has there been an event at time \\(t\\)?” version. We can also convert the predicted survival probabilities into predicted events/non-events based on a threshold (default is 0.50). The survival versions of these metrics need those binary versions of observed truth and predictions as well as a way to account for censoring.\nCensoring plays into the details of the conversion and is additionally captured in the form of weights. For details on both these aspects, see the Accounting for Censoring in Performance Metrics for Event Time Data article.\nTo start, let’s define the various types of times that will be mentioned:\n\nObserved time: time recorded in the data\nEvent time: observed times for actual events\nEvaluation time: the time, specified by the analyst, that the model is evaluated." + }, + { + "objectID": "learn/statistics/survival-metrics/index.html#introduction", + "href": "learn/statistics/survival-metrics/index.html#introduction", + "title": "Dynamic Performance Metrics for Event Time Data", + "section": "", + "text": "To use code in this article, you will need to install the following packages: censored, prodlim, and tidymodels. You’ll need the development versions of censored and parsnip. To install these, use\n#install.packages(\"pak\")\n\npak::pak(c(\"tidymodels/censored\", \"tidymodels/parsnip\"))\nOne trend in modern survival analysis is to compute time-dependent measures of performance. These are primarily driven by an increased focus on predictions for the probability of survival at a given time (as opposed to the predictions of event times or linear predictors). Since these are conditional on the time of evaluation, we call them dynamic performance metrics.\nMany dynamic metrics are similar to those used for binary classification models. The basic idea is that, for a given time \\(t\\) for model evaluation, we try to encode the observed event time data into a binary “has there been an event at time \\(t\\)?” version. We can also convert the predicted survival probabilities into predicted events/non-events based on a threshold (default is 0.50). The survival versions of these metrics need those binary versions of observed truth and predictions as well as a way to account for censoring.\nCensoring plays into the details of the conversion and is additionally captured in the form of weights. For details on both these aspects, see the Accounting for Censoring in Performance Metrics for Event Time Data article.\nTo start, let’s define the various types of times that will be mentioned:\n\nObserved time: time recorded in the data\nEvent time: observed times for actual events\nEvaluation time: the time, specified by the analyst, that the model is evaluated." + }, + { + "objectID": "learn/statistics/survival-metrics/index.html#example-data", + "href": "learn/statistics/survival-metrics/index.html#example-data", + "title": "Dynamic Performance Metrics for Event Time Data", + "section": "Example data", + "text": "Example data\nAs an example, we’ll simulate some data with the prodlim package, using the methods of Bender et al (2005). A training and a validation set are simulated. We’ll also load the censored package so that we can fit a model to these time-to-event data:\n\nlibrary(tidymodels)\nlibrary(censored)\n#> Loading required package: survival\nlibrary(prodlim)\n\nset.seed(5882)\nsim_dat <- SimSurv(2000) %>%\n mutate(event_time = Surv(time, event)) %>%\n select(event_time, X1, X2)\n\nset.seed(2)\nsplit <- initial_split(sim_dat)\nsim_tr <- training(split)\nsim_val <- testing(split)\n\n## Resampling object\nsim_rs <- vfold_cv(sim_tr)\n\nWe’ll need a model to illustrate the code and concepts. Let’s fit a bagged survival tree model to the training set:\n\nset.seed(17)\nbag_tree_fit <- \n bag_tree() %>% \n set_mode(\"censored regression\") %>% \n set_engine(\"rpart\") %>% \n fit(event_time ~ ., data = sim_tr)\nbag_tree_fit\n#> parsnip model object\n#> \n#> \n#> Bagging survival trees with 25 bootstrap replications \n#> \n#> Call: bagging.data.frame(formula = event_time ~ ., data = data)\n\nUsing this model, we’ll make predictions of different types." + }, + { + "objectID": "learn/statistics/survival-metrics/index.html#survival-probability-prediction", + "href": "learn/statistics/survival-metrics/index.html#survival-probability-prediction", + "title": "Dynamic Performance Metrics for Event Time Data", + "section": "Survival Probability Prediction", + "text": "Survival Probability Prediction\nThis censored regression model can make static predictions via the predicted event time using predict(object, type = \"time\"). It can also create dynamic predictions regarding the probability of survival for each data point at specific times. The syntax for this is\npredict(object, new_data, type = \"survival\", eval_time = numeric())\nwhere eval_time is a vector of time points at which we want the corresponding survivor function estimates. Alternatively, we can use the augment() function to get both types of prediction and automatically attach them to the data.\nThe largest event time in the training set is 21.083 so we will use a set of evaluation times between zero and 21.\n\ntime_points <- seq(0, 21, by = 0.25)\n\nval_pred <- augment(bag_tree_fit, sim_val, eval_time = time_points)\nval_pred\n#> # A tibble: 500 × 5\n#> .pred .pred_time event_time X1 X2\n#> <list> <dbl> <Surv> <dbl> <dbl>\n#> 1 <tibble [85 × 5]> 6.66 4.831957 1 -0.630 \n#> 2 <tibble [85 × 5]> 6.66 6.110031 1 -0.606 \n#> 3 <tibble [85 × 5]> 7.47 6.597774+ 1 -1.03 \n#> 4 <tibble [85 × 5]> 3.29 2.717484 1 0.811 \n#> 5 <tibble [85 × 5]> 5.10 4.727042+ 1 -0.376 \n#> 6 <tibble [85 × 5]> 4.99 8.699061 0 1.18 \n#> 7 <tibble [85 × 5]> 7.23 10.818670 1 -0.851 \n#> 8 <tibble [85 × 5]> 6.46 6.886378 0 0.493 \n#> 9 <tibble [85 × 5]> 4.75 2.451893+ 1 0.0207\n#> 10 <tibble [85 × 5]> 13.4 8.231911+ 0 -1.52 \n#> # ℹ 490 more rows\n\nThe observed data are in the event_time column. The predicted survival probabilities are in the .pred column. This is a list column with a data frame for each observation, containing the predictions at the 85 evaluation time points in the (nested) column .pred_survival.\n\nval_pred$.pred[[1]]\n#> # A tibble: 85 × 5\n#> .eval_time .pred_survival .weight_time .pred_censored .weight_censored\n#> <dbl> <dbl> <dbl> <dbl> <dbl>\n#> 1 0 1 0 1 1 \n#> 2 0.25 1 0.250 0.999 1.00\n#> 3 0.5 0.999 0.500 0.996 1.00\n#> 4 0.75 0.992 0.750 0.993 1.01\n#> 5 1 0.988 1.00 0.991 1.01\n#> 6 1.25 0.980 1.25 0.987 1.01\n#> 7 1.5 0.972 1.50 0.981 1.02\n#> 8 1.75 0.959 1.75 0.971 1.03\n#> 9 2 0.938 2.00 0.966 1.04\n#> 10 2.25 0.925 2.25 0.959 1.04\n#> # ℹ 75 more rows\n\nThe yardstick package currently has two dynamic metrics. Each is described below." + }, + { + "objectID": "learn/statistics/survival-metrics/index.html#brier-score", + "href": "learn/statistics/survival-metrics/index.html#brier-score", + "title": "Dynamic Performance Metrics for Event Time Data", + "section": "Brier Score", + "text": "Brier Score\nThe Brier score is a metric that can be used with both classification and event-time models. For classification models, it computes the squared error between the observed outcome (encoded as 0/1) and the corresponding predicted probability for the class.\nA little math: suppose that the value \\(y_{ik}\\) is a 0/1 indicator for whether the observed outcome \\(i\\) corresponds to class \\(k\\), and \\(\\hat{p}_{ik}\\) is the estimated class probability. The classification score is then:\n\\[\nBrier_{class} = \\frac{1}{N}\\sum_{i=1}^N\\sum_{k=1}^C (y_{ik} - \\hat{p}_{ik})^2\n\\]\nFor survival models, we transform the event time data into a binary version \\(y_{it}\\): is there an event at evaluation time \\(t\\)1. The survival function estimate \\(\\hat{p}_{it}\\) is the probability that corresponds to non-events at time \\(t\\). For example, if there has not been an event at the current evaluation time, our best model should estimate the survival probability to be near one. For observations that are events, the probability estimate is just one minus the survivor estimate. To account for censoring, we also weight each observation with \\(w_{it}\\). The time-dependent Brier score is:\n\\[\nBrier_{surv}(t) = \\frac{1}{N}\\sum_{i=1}^N w_{it}\\left[\\underbrace{I(y_{it} = 0)(y_{it} - \\hat{p}_{it})^2}_\\text{non-events} + \\underbrace{I(y_{it} = 1)(y_{it} - (1 - \\hat{p}_{it}))^2}_\\text{events}\\right]\n\\]\nFor this score, a perfect model has a score of zero, while an uninformative model would have a score of around 1/4.\nHow do we compute this using the yardstick package? The function brier_survival() follows the same convention as the other metric functions. The main arguments are:\n\ndata: the data frame with the predictions (structured as the output produced by augment(), as shown above).\ntruth: the name of the column with the Surv object.\n...: the name of the column with the dynamic predictions. Within tidymodels, this column is always called .pred. In other words, .pred should be passed without an argument name.\n\nSince the evaluation times and the case weights are within the .pred column, there is no need to specify these. Here are the results of our validation set:\n\nbrier_scores <-\n val_pred %>% \n brier_survival(truth = event_time, .pred)\nbrier_scores\n#> # A tibble: 85 × 4\n#> .metric .estimator .eval_time .estimate\n#> <chr> <chr> <dbl> <dbl>\n#> 1 brier_survival standard 0 0 \n#> 2 brier_survival standard 0.25 0 \n#> 3 brier_survival standard 0.5 0.00202\n#> 4 brier_survival standard 0.75 0.00796\n#> 5 brier_survival standard 1 0.0266 \n#> 6 brier_survival standard 1.25 0.0402 \n#> 7 brier_survival standard 1.5 0.0563 \n#> 8 brier_survival standard 1.75 0.0785 \n#> 9 brier_survival standard 2 0.0895 \n#> 10 brier_survival standard 2.25 0.0951 \n#> # ℹ 75 more rows\n\nOver time:\n\nbrier_scores %>% \n ggplot(aes(.eval_time, .estimate)) + \n geom_hline(yintercept = 1 / 4, col = \"red\", lty = 3) +\n geom_line() +\n geom_point() + \n labs(x = \"time\", y = \"Brier score\")\n\n\n\n\n\n\n\n\nThere is also an integrated Brier score. This required evaluation times as inputs but instead of returning each result, it takes the area under the plot shown above. The syntax is the same but the result has a single row:\n\nval_pred %>% brier_survival_integrated(truth = event_time, .pred)\n#> # A tibble: 1 × 3\n#> .metric .estimator .estimate\n#> <chr> <chr> <dbl>\n#> 1 brier_survival_integrated standard 0.113\n\nAgain, smaller values are better." + }, + { + "objectID": "learn/statistics/survival-metrics/index.html#receiver-operating-characteristic-roc-curves", + "href": "learn/statistics/survival-metrics/index.html#receiver-operating-characteristic-roc-curves", + "title": "Dynamic Performance Metrics for Event Time Data", + "section": "Receiver Operating Characteristic (ROC) Curves", + "text": "Receiver Operating Characteristic (ROC) Curves\nWhen we not only turn the event time data into a binary representation but also the predicted probabilities, we are in well-chartered classification metrics territory. Sensitivity and specificity are common quantities to compute, we do so here in their weighted version to account for censoring:\n\nSensitivity: How well do we predict the events? This is analogous to the true positive rate.\nSpecificity: How well do we predict the non-events? One minus specificity is the false positive rate.\n\nThese depend on the threshold used to turn predicted probabilities into predicted events/non-events. Let’s take a look at the distribution of the survival probabilities for our example data at an evaluation time of 5.00. The distributions are separated by the observed class and weighted by the censoring weights. Details of both aspects are the same as for the Brier score and can be found in the Accounting for Censoring in Performance Metrics for Event Time Data article.\n\n\n\n\n\n\n\n\n\nMore probability values are to the right of the 50% cutoff for the true non-events. Conversely, true events tend to have smaller probabilities. Using this cutoff, the sensitivity would be 66.8% and the specificity would be 82.3%. There are other possible cutoffs for the survival probabilities. Maybe one of these would have better statistics.\nROC curves were designed as a general method that, given a collection of continuous predictions, determines an effective threshold such that values above the threshold indicate a specific event. For our purposes, the ROC curve will compute the sensitivity and specificity for all possible probability thresholds. It then plots the true positive rate versus the false positive rate. Generally, we use the area under the ROC curve to quantify it with a single number. Values near one indicate a perfect model, while values near 1/2 occur with non-informative models.\nBlanche et al (2013) gives a good overview of ROC curves for survival analysis and their Section 4.3 is most relevant here.\nFor our example at evaluation time \\(t = 5.00\\), the ROC curve is:\n\n\n\n\n\n\n\n\n\nThe area under this curve is 0.807.\nSince this is a dynamic metric, we compute the AUC for each evaluation time. The syntax is very similar to the Brier code shown above:\n\nroc_scores <-\n val_pred %>% \n roc_auc_survival(truth = event_time, .pred)\nroc_scores\n#> # A tibble: 85 × 4\n#> .metric .estimator .eval_time .estimate\n#> <chr> <chr> <dbl> <dbl>\n#> 1 roc_auc_survival standard 0 0.5 \n#> 2 roc_auc_survival standard 0.25 0.5 \n#> 3 roc_auc_survival standard 0.5 0.869\n#> 4 roc_auc_survival standard 0.75 0.852\n#> 5 roc_auc_survival standard 1 0.734\n#> 6 roc_auc_survival standard 1.25 0.768\n#> 7 roc_auc_survival standard 1.5 0.792\n#> 8 roc_auc_survival standard 1.75 0.777\n#> 9 roc_auc_survival standard 2 0.771\n#> 10 roc_auc_survival standard 2.25 0.778\n#> # ℹ 75 more rows\n\nOver time:\n\nroc_scores %>% \n ggplot(aes(.eval_time, .estimate)) + \n geom_hline(yintercept = 1 / 2, col = \"red\", lty = 3) +\n geom_line() +\n geom_point() + \n labs(x = \"time\", y = \"ROC AUC\")\n\n\n\n\n\n\n\n\nThe initial variation is due to so few events at the early stages of analysis.\nThe ROC measures the separation between classes and the Brier score focuses more on accurate and well-calibrated predictions. It should not be surprising that each metric’s results over time differ." + }, + { + "objectID": "learn/statistics/survival-metrics/index.html#tuning-these-metrics", + "href": "learn/statistics/survival-metrics/index.html#tuning-these-metrics", + "title": "Dynamic Performance Metrics for Event Time Data", + "section": "Tuning these metrics", + "text": "Tuning these metrics\nMany of the event time models available in tidymodels have tuning parameters. The tune_*() functions and fit_resamples() have an eval_time argument used to pass the evaluation times. The statistics are computed for these time points using out-of-sample data.\nIn some cases, such as iterative search or racing methods, the functions need a single value to optimize. If a dynamic metric is used to guide the optimization, the first evaluation time given by the user will be used.\nFor example, if a model for these data was being optimized, and we wanted a time of 5.00 to guide the process, we would need to use that value of 5.00 as the first element time_points, the vector given to the eval_time argument in our example above." + }, + { + "objectID": "learn/statistics/survival-metrics/index.html#summary", + "href": "learn/statistics/survival-metrics/index.html#summary", + "title": "Dynamic Performance Metrics for Event Time Data", + "section": "Summary", + "text": "Summary\ntidymodels has two time-dependent metrics for characterizing the performance of event-time models:\n\nThe Brier score measures the distance between the observed class result and the predicted probabilities.\nROC curves try to measure the separation between the two classes based on the survival probabilities." + }, + { + "objectID": "learn/statistics/survival-metrics/index.html#session-info", + "href": "learn/statistics/survival-metrics/index.html#session-info", + "title": "Dynamic Performance Metrics for Event Time Data", + "section": "Session information", + "text": "Session information\n\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.3.0 (2023-04-21)\n#> os macOS Ventura 13.4\n#> system aarch64, darwin20\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/Los_Angeles\n#> date 2023-07-02\n#> pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.4 2023-03-11 [1] CRAN (R 4.3.0)\n#> censored * 0.2.0.9000 2023-07-02 [1] Github (tidymodels/censored@f9eccb6)\n#> dials * 1.2.0 2023-04-03 [1] CRAN (R 4.3.0)\n#> dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n#> ggplot2 * 3.4.2 2023-04-03 [1] CRAN (R 4.3.0)\n#> infer * 1.0.4 2022-12-02 [1] CRAN (R 4.3.0)\n#> parsnip * 1.1.0.9003 2023-07-02 [1] Github (tidymodels/parsnip@e6278b0)\n#> prodlim * 2023.03.31 2023-04-02 [1] CRAN (R 4.3.0)\n#> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)\n#> recipes * 1.0.6 2023-04-25 [1] CRAN (R 4.3.0)\n#> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n#> rsample * 1.1.1 2022-12-07 [1] CRAN (R 4.3.0)\n#> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n#> tidymodels * 1.1.0 2023-05-01 [1] CRAN (R 4.3.0)\n#> tune * 1.1.1 2023-04-11 [1] CRAN (R 4.3.0)\n#> workflows * 1.1.3 2023-02-22 [1] CRAN (R 4.3.0)\n#> yardstick * 1.2.0 2023-04-21 [1] CRAN (R 4.3.0)\n#> \n#> [1] /Users/emilhvitfeldt/Library/R/arm64/4.3/library\n#> [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────" + }, + { + "objectID": "learn/statistics/survival-metrics/index.html#footnotes", + "href": "learn/statistics/survival-metrics/index.html#footnotes", + "title": "Dynamic Performance Metrics for Event Time Data", + "section": "Footnotes", + "text": "Footnotes\n\n\nAgain, see the Accounting for Censoring in Performance Metrics for Event Time Data article for more on this.↩︎" + }, { "objectID": "learn/statistics/k-means/index.html", "href": "learn/statistics/k-means/index.html", @@ -508,7 +578,7 @@ "href": "learn/index.html", "title": "Learn", "section": "", - "text": "After you know what you need to get started with tidymodels, you can learn more and go further. Find articles here to help you solve specific problems using the tidymodels framework.\n\n\n\n\n\n\n\n\n\n\n\n\nA predictive modeling case study\n\n\n\nmodel fitting\n\n\ntuning\n\n\nparsnip\n\n\nrecipes\n\n\nrsample\n\n\nworkflows\n\n\ntune\n\n\n\nDevelop, from beginning to end, a predictive model using best practices.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAn introduction to calibration with tidymodels\n\n\n\nprobably\n\n\nyardstick\n\n\nclassification\n\n\ncalibration\n\n\n\nLearn how the probably package can improve classification and regression models.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nBootstrap resampling and tidy regression models\n\n\n\nstatistical analysis\n\n\nbootstrapping\n\n\ntidying results\n\n\nconfidence intervals\n\n\n\nApply bootstrap resampling to estimate uncertainty in model parameters.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nBuild a model\n\n\n\nmodel fitting\n\n\nparsnip\n\n\nbroom\n\n\n\nGet started by learning how to specify and train a model using tidymodels.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nClassification models using a neural network\n\n\n\nmodel fitting\n\n\ntorch\n\n\nneural networks\n\n\n\nTrain a classification model and evaluate its performance.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCorrelation and regression fundamentals with tidy data principles\n\n\n\nstatistical analysis\n\n\ncorrelation\n\n\ntidying results\n\n\n\nAnalyze the results of correlation tests and simple regression models for many data sets at once.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCreate your own broom tidier methods\n\n\n\ndeveloper tools\n\n\n\nWrite tidy(), glance(), and augment() methods for new model objects.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCreate your own recipe step function\n\n\n\ndeveloper tools\n\n\n\nWrite a new recipe step for data preprocessing.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCreating case weights based on time\n\n\n\nmodel fitting\n\n\ncase weights\n\n\ntime series\n\n\n\nCreate models that use coefficients, extract them from fitted models, and visualize them.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCustom performance metrics\n\n\n\ndeveloper tools\n\n\n\nCreate a new performance metric and integrate it with yardstick functions.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nEvaluate your model with resampling\n\n\n\nresampling\n\n\nrsample\n\n\nparsnip\n\n\ntune\n\n\nworkflows\n\n\nyardstick\n\n\n\nMeasure model performance by generating different versions of the training data through resampling.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHow to build a parsnip model\n\n\n\ndeveloper tools\n\n\n\nCreate a parsnip model function from an existing model implementation.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHow to create a tuning parameter function\n\n\n\ndeveloper tools\n\n\n\nBuild functions to use in tuning both quantitative and qualitative parameters.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHypothesis testing using resampling and tidy data\n\n\n\nstatistical analysis\n\n\nhypothesis testing\n\n\nbootstrapping\n\n\n\nPerform common hypothesis tests for statistical inference using flexible functions.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIterative Bayesian optimization of a classification model\n\n\n\nmodel tuning\n\n\nBayesian optimization\n\n\nSVMs\n\n\n\nIdentify the best hyperparameters for a model using Bayesian optimization of iterative search.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nK-means clustering with tidy data principles\n\n\n\nstatistical analysis\n\n\nclustering\n\n\ntidying results\n\n\n\nSummarize clustering characteristics and estimate the best number of clusters for a data set.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nModel tuning via grid search\n\n\n\nmodel tuning\n\n\nSVMs\n\n\n\nChoose hyperparameters for a model by training on a grid of many possible parameter values.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nModeling time series with tidy resampling\n\n\n\nmodel fitting\n\n\ntime series\n\n\n\nCalculate performance estimates for time series forecasts using resampling.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMultivariate analysis using partial least squares\n\n\n\npre-processing\n\n\nmultivariate analysis\n\n\npartial least squares\n\n\n\nBuild and fit a predictive model with more than one outcome.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nNested resampling\n\n\n\nnested resampling\n\n\nSVMs\n\n\n\nEstimate the best hyperparameters for a model using nested resampling.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPreprocess your data with recipes\n\n\n\npre-processing\n\n\nrecipes\n\n\nparsnip\n\n\nworkflows\n\n\nyardstick\n\n\nbroom\n\n\n\nPrepare data for modeling with modular preprocessing steps.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nRegression models two ways\n\n\n\nmodel fitting\n\n\nrandom forests\n\n\nlinear regression\n\n\n\nCreate and train different kinds of regression models with different computational engines.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nStatistical analysis of contingency tables\n\n\n\nstatistical analysis\n\n\nanalysis of tables\n\n\nhypothesis testing\n\n\n\nUse tests of independence and goodness of fit to analyze tables of counts.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSubsampling for class imbalances\n\n\n\nmodel fitting\n\n\npre-processing\n\n\nclass imbalances\n\n\ndiscriminant analysis\n\n\n\nImprove model performance in imbalanced data sets through undersampling or oversampling.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTune model parameters\n\n\n\ntuning\n\n\nrsample\n\n\nparsnip\n\n\ntune\n\n\ndials\n\n\nworkflows\n\n\nyardstick\n\n\n\nEstimate the best values for hyperparameters that cannot be learned directly during model training.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTuning text models\n\n\n\nmodel tuning\n\n\ntext analysis\n\n\nlogistic regression\n\n\nBayesian optimization\n\n\nextracting results\n\n\n\nPrepare text data for predictive modeling and tune with both grid and iterative search.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWorking with model coefficients\n\n\n\nmodel fitting\n\n\ntidying results\n\n\nlinear regression\n\n\nmodel tuning\n\n\n\nCreate models that use coefficients, extract them from fitted models, and visualize them.\n\n\n\n\n\n\n\n\n\n\n\n\nNo matching items" + "text": "After you know what you need to get started with tidymodels, you can learn more and go further. Find articles here to help you solve specific problems using the tidymodels framework.\n\n\n\n\n\n\n\n\n\n\n\n\nA predictive modeling case study\n\n\n\nmodel fitting\n\n\ntuning\n\n\nparsnip\n\n\nrecipes\n\n\nrsample\n\n\nworkflows\n\n\ntune\n\n\n\nDevelop, from beginning to end, a predictive model using best practices.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAccounting for Censoring in Performance Metrics for Event Time Data\n\n\n\nstatistical analysis\n\n\nsurvival analysis\n\n\n\nLearn how tidymodels uses causal inference tools to measure performance of survival models.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAn introduction to calibration with tidymodels\n\n\n\nprobably\n\n\nyardstick\n\n\nclassification\n\n\ncalibration\n\n\n\nLearn how the probably package can improve classification and regression models.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nBootstrap resampling and tidy regression models\n\n\n\nstatistical analysis\n\n\nbootstraping\n\n\ntidying results\n\n\nconfidence intervals\n\n\n\nApply bootstrap resampling to estimate uncertainty in model parameters.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nBuild a model\n\n\n\nmodel fitting\n\n\nparsnip\n\n\nbroom\n\n\n\nGet started by learning how to specify and train a model using tidymodels.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nClassification models using a neural network\n\n\n\nmodel fitting\n\n\ntorch\n\n\nneural networks\n\n\n\nTrain a classification model and evaluate its performance.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCorrelation and regression fundamentals with tidy data principles\n\n\n\nstatistical analysis\n\n\ncorrelation\n\n\ntidying results\n\n\n\nAnalyze the results of correlation tests and simple regression models for many data sets at once.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCreate your own broom tidier methods\n\n\n\ndeveloper tools\n\n\n\nWrite tidy(), glance(), and augment() methods for new model objects.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCreate your own recipe step function\n\n\n\ndeveloper tools\n\n\n\nWrite a new recipe step for data preprocessing.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCreating case weights based on time\n\n\n\nmodel fitting\n\n\ncase weights\n\n\ntime series\n\n\n\nCreate models that use coefficients, extract them from fitted models, and visualize them.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCustom performance metrics\n\n\n\ndeveloper tools\n\n\n\nCreate a new performance metric and integrate it with yardstick functions.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDynamic Performance Metrics for Event Time Data\n\n\n\nstatistical analysis\n\n\nsurvival analysis\n\n\n\nLet’s discuss how to compute modern performance metrics for time-to-event models.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nEvaluate your model with resampling\n\n\n\nresampling\n\n\nrsample\n\n\nparsnip\n\n\ntune\n\n\nworkflows\n\n\nyardstick\n\n\n\nMeasure model performance by generating different versions of the training data through resampling.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHow to build a parsnip model\n\n\n\ndeveloper tools\n\n\n\nCreate a parsnip model function from an existing model implementation.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHow to create a tuning parameter function\n\n\n\ndeveloper tools\n\n\n\nBuild functions to use in tuning both quantitative and qualitative parameters.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHypothesis testing using resampling and tidy data\n\n\n\nstatistical analysis\n\n\nhypothesis testing\n\n\nbootstraping\n\n\n\nPerform common hypothesis tests for statistical inference using flexible functions.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIterative Bayesian optimization of a classification model\n\n\n\nmodel tuning\n\n\nBayesian optimization\n\n\nSVMs\n\n\n\nIdentify the best hyperparameters for a model using Bayesian optimization of iterative search.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nK-means clustering with tidy data principles\n\n\n\nstatistical analysis\n\n\nclustering\n\n\ntidying results\n\n\n\nSummarize clustering characteristics and estimate the best number of clusters for a data set.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nModel tuning via grid search\n\n\n\nmodel tuning\n\n\nSVMs\n\n\n\nChoose hyperparameters for a model by training on a grid of many possible parameter values.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nModeling time series with tidy resampling\n\n\n\nmodel fitting\n\n\ntime series\n\n\n\nCalculate performance estimates for time series forecasts using resampling.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMultivariate analysis using partial least squares\n\n\n\npre-processing\n\n\nmultivariate analysis\n\n\npartial least squares\n\n\n\nBuild and fit a predictive model with more than one outcome.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nNested resampling\n\n\n\nnested resampling\n\n\nSVMs\n\n\n\nEstimate the best hyperparameters for a model using nested resampling.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPreprocess your data with recipes\n\n\n\npre-processing\n\n\nrecipes\n\n\nparsnip\n\n\nworkflows\n\n\nyardstick\n\n\nbroom\n\n\n\nPrepare data for modeling with modular preprocessing steps.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nRegression models two ways\n\n\n\nmodel fitting\n\n\nrandom forests\n\n\nlinear regression\n\n\n\nCreate and train different kinds of regression models with different computational engines.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nStatistical analysis of contingency tables\n\n\n\nstatistical analysis\n\n\nanalysis of tables\n\n\nhypothesis testing\n\n\n\nUse tests of independence and goodness of fit to analyze tables of counts.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSubsampling for class imbalances\n\n\n\nmodel fitting\n\n\npre-processing\n\n\nclass imbalances\n\n\ndiscriminant analysis\n\n\n\nImprove model performance in imbalanced data sets through undersampling or oversampling.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTune model parameters\n\n\n\ntuning\n\n\nrsample\n\n\nparsnip\n\n\ntune\n\n\ndials\n\n\nworkflows\n\n\nyardstick\n\n\n\nEstimate the best values for hyperparameters that cannot be learned directly during model training.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTuning text models\n\n\n\nmodel tuning\n\n\ntext analysis\n\n\nlogistic regression\n\n\nBayesian optimization\n\n\nextracting results\n\n\n\nPrepare text data for predictive modeling and tune with both grid and iterative search.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWorking with model coefficients\n\n\n\nmodel fitting\n\n\ntidying results\n\n\nlinear regression\n\n\nmodel tuning\n\n\n\nCreate models that use coefficients, extract them from fitted models, and visualize them.\n\n\n\n\n\n\n\n\n\n\n\n\nNo matching items" }, { "objectID": "learn/develop/parameters/index.html", @@ -1175,6 +1245,62 @@ "section": "Session information", "text": "Session information\n\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.3.0 (2023-04-21)\n#> os macOS Ventura 13.4\n#> system aarch64, darwin20\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/Los_Angeles\n#> date 2023-07-02\n#> pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.4 2023-03-11 [1] CRAN (R 4.3.0)\n#> dials * 1.2.0 2023-04-03 [1] CRAN (R 4.3.0)\n#> dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n#> ggplot2 * 3.4.2 2023-04-03 [1] CRAN (R 4.3.0)\n#> infer * 1.0.4 2022-12-02 [1] CRAN (R 4.3.0)\n#> parsnip * 1.1.0 2023-04-12 [1] CRAN (R 4.3.0)\n#> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)\n#> recipes * 1.0.6 2023-04-25 [1] CRAN (R 4.3.0)\n#> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n#> rsample * 1.1.1 2022-12-07 [1] CRAN (R 4.3.0)\n#> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n#> tidymodels * 1.1.0 2023-05-01 [1] CRAN (R 4.3.0)\n#> tune * 1.1.1 2023-04-11 [1] CRAN (R 4.3.0)\n#> workflows * 1.1.3 2023-02-22 [1] CRAN (R 4.3.0)\n#> yardstick * 1.2.0 2023-04-21 [1] CRAN (R 4.3.0)\n#> \n#> [1] /Users/emilhvitfeldt/Library/R/arm64/4.3/library\n#> [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────" }, + { + "objectID": "learn/statistics/survival-metrics-details/index.html", + "href": "learn/statistics/survival-metrics-details/index.html", + "title": "Accounting for Censoring in Performance Metrics for Event Time Data", + "section": "", + "text": "To use code in this article, you will need to install the following packages: censored, prodlim, and tidymodels. You’ll need the development versions of censored and parsnip. To install these, use\n#install.packages(\"pak\")\n\npak::pak(c(\"tidymodels/censored\", \"tidymodels/parsnip\"))\nOne trend in modern survival analysis is to compute time-dependent measures of performance. These are primarily driven by an increased focus on predictions for the probability of survival at a given time (as opposed to the predictions of event times or linear predictors). Since these are conditional on the time of evaluation, we call them dynamic performance metrics.\nMany dynamic metrics are similar to those used for binary classification models. Examples include the Brier score and ROC curves (see the Dynamic Performance Metrics for Event Time Data article for details). The basic idea is that, for a given time \\(t\\) for model evaluation, we try to encode the observed event time data into a binary “has there been an event at time \\(t\\)?” version. We also convert the predicted survival probabilities into predicted events/non-events based on a threshold (default is 0.50). The survival versions of these metrics need those binary versions of observed truth and predictions as well as a way to account for censoring.\nCensoring plays into the details of the conversion and is additionally captured in the form of weights. This article covers both those aspects in detail to complement the main article on performance metrics for event time data.\nTo start, let’s define the various types of times that will be mentioned:\n\nObserved time: time recorded in the data\nEvent time: observed times for actual events\nEvaluation time: the time, specified by the analyst, that the model is evaluated." + }, + { + "objectID": "learn/statistics/survival-metrics-details/index.html#introduction", + "href": "learn/statistics/survival-metrics-details/index.html#introduction", + "title": "Accounting for Censoring in Performance Metrics for Event Time Data", + "section": "", + "text": "To use code in this article, you will need to install the following packages: censored, prodlim, and tidymodels. You’ll need the development versions of censored and parsnip. To install these, use\n#install.packages(\"pak\")\n\npak::pak(c(\"tidymodels/censored\", \"tidymodels/parsnip\"))\nOne trend in modern survival analysis is to compute time-dependent measures of performance. These are primarily driven by an increased focus on predictions for the probability of survival at a given time (as opposed to the predictions of event times or linear predictors). Since these are conditional on the time of evaluation, we call them dynamic performance metrics.\nMany dynamic metrics are similar to those used for binary classification models. Examples include the Brier score and ROC curves (see the Dynamic Performance Metrics for Event Time Data article for details). The basic idea is that, for a given time \\(t\\) for model evaluation, we try to encode the observed event time data into a binary “has there been an event at time \\(t\\)?” version. We also convert the predicted survival probabilities into predicted events/non-events based on a threshold (default is 0.50). The survival versions of these metrics need those binary versions of observed truth and predictions as well as a way to account for censoring.\nCensoring plays into the details of the conversion and is additionally captured in the form of weights. This article covers both those aspects in detail to complement the main article on performance metrics for event time data.\nTo start, let’s define the various types of times that will be mentioned:\n\nObserved time: time recorded in the data\nEvent time: observed times for actual events\nEvaluation time: the time, specified by the analyst, that the model is evaluated." + }, + { + "objectID": "learn/statistics/survival-metrics-details/index.html#example-data", + "href": "learn/statistics/survival-metrics-details/index.html#example-data", + "title": "Accounting for Censoring in Performance Metrics for Event Time Data", + "section": "Example data", + "text": "Example data\nAs an example, we’ll simulate some data with the prodlim package, using the methods of Bender et al (2005). A training and a validation set are simulated. We’ll also load the censored package so that we can fit a model to these time-to-event data:\n\nlibrary(tidymodels)\nlibrary(censored)\n#> Loading required package: survival\nlibrary(prodlim)\n\nset.seed(5882)\nsim_dat <- SimSurv(2000) %>%\n mutate(event_time = Surv(time, event)) %>%\n select(event_time, X1, X2)\n\nset.seed(2)\nsplit <- initial_split(sim_dat)\nsim_tr <- training(split)\nsim_val <- testing(split)\n\nWe’ll need a model to illustrate the code and concepts. Let’s fit a bagged survival tree model to the training set:\n\nset.seed(17)\nbag_tree_fit <- \n bag_tree() %>% \n set_mode(\"censored regression\") %>% \n set_engine(\"rpart\") %>% \n fit(event_time ~ ., data = sim_tr)\nbag_tree_fit\n#> parsnip model object\n#> \n#> \n#> Bagging survival trees with 25 bootstrap replications \n#> \n#> Call: bagging.data.frame(formula = event_time ~ ., data = data)\n\nUsing this model, we can make predictions of different types and augment() provides us with a version of the data augmented with the various predictions. Here we are interested in the predicted probability of survival at different evaluation time points. The largest event time in the training set is 21.083 so we will use a set of evaluation times between zero and 21.\n\ntime_points <- seq(0, 21, by = 0.25)\n\nval_pred <- augment(bag_tree_fit, sim_val, eval_time = time_points)\nval_pred\n#> # A tibble: 500 × 5\n#> .pred .pred_time event_time X1 X2\n#> <list> <dbl> <Surv> <dbl> <dbl>\n#> 1 <tibble [85 × 5]> 6.66 4.831957 1 -0.630 \n#> 2 <tibble [85 × 5]> 6.66 6.110031 1 -0.606 \n#> 3 <tibble [85 × 5]> 7.47 6.597774+ 1 -1.03 \n#> 4 <tibble [85 × 5]> 3.29 2.717484 1 0.811 \n#> 5 <tibble [85 × 5]> 5.10 4.727042+ 1 -0.376 \n#> 6 <tibble [85 × 5]> 4.99 8.699061 0 1.18 \n#> 7 <tibble [85 × 5]> 7.23 10.818670 1 -0.851 \n#> 8 <tibble [85 × 5]> 6.46 6.886378 0 0.493 \n#> 9 <tibble [85 × 5]> 4.75 2.451893+ 1 0.0207\n#> 10 <tibble [85 × 5]> 13.4 8.231911+ 0 -1.52 \n#> # ℹ 490 more rows\n\nThe observed data are in the event_time column. The predicted survival probabilities are in the .pred column. This is a list column with a data frame for each observation, containing the predictions at the 85 evaluation time points in the (nested) column .pred_survival.\n\nval_pred$.pred[[1]]\n#> # A tibble: 85 × 5\n#> .eval_time .pred_survival .weight_time .pred_censored .weight_censored\n#> <dbl> <dbl> <dbl> <dbl> <dbl>\n#> 1 0 1 0 1 1 \n#> 2 0.25 1 0.250 0.999 1.00\n#> 3 0.5 0.999 0.500 0.996 1.00\n#> 4 0.75 0.992 0.750 0.993 1.01\n#> 5 1 0.988 1.00 0.991 1.01\n#> 6 1.25 0.980 1.25 0.987 1.01\n#> 7 1.5 0.972 1.50 0.981 1.02\n#> 8 1.75 0.959 1.75 0.971 1.03\n#> 9 2 0.938 2.00 0.966 1.04\n#> 10 2.25 0.925 2.25 0.959 1.04\n#> # ℹ 75 more rows\n\nFirst, let’s dive into how to convert the observed event time in event_time to a binary version. Then we will discuss the remaining columns as part of generating the required weights for the dynamic performance metrics." + }, + { + "objectID": "learn/statistics/survival-metrics-details/index.html#converting-censored-data-to-binary-data", + "href": "learn/statistics/survival-metrics-details/index.html#converting-censored-data-to-binary-data", + "title": "Accounting for Censoring in Performance Metrics for Event Time Data", + "section": "Converting censored data to binary data", + "text": "Converting censored data to binary data\nTo assess model performance at evaluation time \\(t\\), we turn the observed event time data into a binary “was there an event at time \\(t\\)?” version. For this, we follow the process described by Graf et al (1999) where observations are categorized into three groups, at evaluation time \\(t\\).\n\nCategory 1 - Events: Evaluation time is greater than or equal to the event time (“it has already happened”).\nCategory 2 - Non-events: Evaluation time is less than the observed time, censored or not (“nothing has happened yet”).\nCategory 3 - Ambiguous outcomes: Evaluation time is greater than or equal to the observed censored time (“we don’t know if anything might have happened by now”).\n\nWe can use binary versions of the observations in the first two categories to compute binary performance metrics, but the observations in the third category are not used directly in these calculations. (They do influence the calculation of the weights, see next section.) So our usable sample size changes with the evaluation time." + }, + { + "objectID": "learn/statistics/survival-metrics-details/index.html#censoring-weights", + "href": "learn/statistics/survival-metrics-details/index.html#censoring-weights", + "title": "Accounting for Censoring in Performance Metrics for Event Time Data", + "section": "Censoring weights", + "text": "Censoring weights\nUnfortunately, this categorization scheme alone is not sufficient to compute metrics. Graf et al took a page from the causal inference literature and added a propensity-type score based on the likelihood that each data point would be censored (regardless of the observed event status). This is not the probability than the original time-to-event data point is censored but rather the probability that at evaluation time, we have not observed an event (or a censoring) yet, i.e., that the data point falls into category 2.\nHow do we compute this probability? The standard approach is to compute a “reverse Kaplan-Meier” (RKM) curve. Ordinarily, the Kaplan-Meier (KM) curve models the probability of survival. For the reverse Kaplan-Meier curve, the meaning of the status indicator is flipped, i.e., the event of interest changes from “death” to “censoring”. This should give us a reasonably reliable non-parametric model for estimating the probability of being censored at a given time.\nEvery time a censored regression model is created using tidymodels, the RKM is estimated on the same data used to fit the model and attached to the parsnip object.\nFor our simulated data, here is what the RKM curve looks like:\n\n\n\n\n\n\n\n\n\nThe red rug on the bottom shows the training point event times and the blue values at the top represent the times for the censored training set observations. As (evaluation) time increases, we pass more and more observed time points, and the probability of being censored, i.e., the probability of an observation to fall into category 2, decreases.\nThe weights used in the calculation of the dynamic performance metrics are the inverse of these probabilities, hence the name “inverse probability of censoring weights” (IPCW). These weights should theoretically balance the exposure/effect/influence that the definitive observations have on performance calculations.\n\nThe finer details\nFirst, when do we evaluate the probability of censoring? There are different approaches used in the literature, and we follow what Graf et al suggest (as it seems most appropriate):\n\nIf the evaluation time is less than the observed time (like in category 2), the evaluation time is used to predict the probability of censoring.\nIf the evaluation is greater than or equal to the event time (like in category 1), the event time is used to predict the probability of censoring.\nIf the evaluation time is greater than or equal to the observed censoring time, the observation falls into category 3 and is not used, i.e., also no weight is needed.\n\nWe call this time at which to predict the probability of censoring the weight time. Here’s an example using the first data point in the validation set:\n\ndyn_val_pred <- \n val_pred %>% \n select(.pred, event_time) %>% \n add_rowindex() %>% \n unnest(.pred) \n\ndyn_val_pred %>% \n filter(.row == 1 & .eval_time %in% c(1, 2, 4, 5, 10)) %>% \n select(event_time, .eval_time, .weight_time, .pred_censored, .weight_censored)\n#> # A tibble: 5 × 5\n#> event_time .eval_time .weight_time .pred_censored .weight_censored\n#> <Surv> <dbl> <dbl> <dbl> <dbl>\n#> 1 4.831957 1 1.00 0.991 1.01\n#> 2 4.831957 2 2.00 0.966 1.04\n#> 3 4.831957 4 4.00 0.848 1.18\n#> 4 4.831957 5 4.83 0.786 1.27\n#> 5 4.831957 10 4.83 0.786 1.27\n\nThis observation was an event, observed at time 4.832 The column .weight_time captures at which time the probability of censoring was calculated. Up until that event time, the probability of being censored is computed at the evaluation time. After that, it is based on the event time.\nWe add a slight modification to the weight time: If our evaluation time is today, we don’t have today’s data yet. In tidymodels, we calculate the probability of censoring just before the requested weight time. We are basically subtracting a small numerical value from the weight time used in the RKM model. You’ll only really see a difference if there is a bulk of censored observations at the original evaluation time.\nFinally, we use a simple RKM curve (i.e., no covariates or strata). This implies that there is non-informative censoring. Other applications of IPCW try to mitigate the effects of informative censoring. In the future, we may allow the censoring model to include covariates (as well as models beyond the RKM)." + }, + { + "objectID": "learn/statistics/survival-metrics-details/index.html#illustration-confusion-matrix", + "href": "learn/statistics/survival-metrics-details/index.html#illustration-confusion-matrix", + "title": "Accounting for Censoring in Performance Metrics for Event Time Data", + "section": "Illustration: Confusion matrix", + "text": "Illustration: Confusion matrix\nTo illustrate how these two tools for accounting for censoring are used in calculating dynamic performance metrics, we’ll take a look here at the 2x2 confusion matrices at a few evaluation time points. More details on performance metrics for censored data are in the aforementioned Dynamic Performance Metrics for Event Time Data article.\nFirst, let’s turn the observed event time data and the predictions into their binary versions.\n\ntime_as_binary_event <- function(surv, eval_time) {\n event_time <- .extract_surv_time(surv)\n status <- .extract_surv_status(surv)\n is_event_before_t <- event_time <= eval_time & status == 1\n \n # Three possible contributions to the statistic from Graf 1999\n # Censoring time before eval_time, no contribution (Graf category 3)\n binary_res <- rep(NA_character_, length(event_time))\n \n # A real event prior to eval_time (Graf category 1)\n binary_res <- if_else(is_event_before_t, \"event\", binary_res)\n \n # Observed time greater than eval_time (Graf category 2)\n binary_res <- if_else(event_time > eval_time, \"non-event\", binary_res)\n factor(binary_res, levels = c(\"event\", \"non-event\"))\n}\n\nbinary_encoding <- \n dyn_val_pred %>% \n mutate(\n obs_class = time_as_binary_event(event_time, .eval_time),\n pred_class = if_else(.pred_survival >= 1 / 2, \"non-event\", \"event\"),\n pred_class = factor(pred_class, levels = c(\"event\", \"non-event\")),\n )\n\nRemember how observations falling into category 3 are removed from the analysis? This means we’ll likely have fewer data points to evaluate as the evaluation time increases. This implies that the variation in the metrics will be considerable as evaluation time goes on. For our simulated training set:\n\ndyn_val_pred %>% \n summarize(num_usable = sum(!is.na(.weight_censored)), .by = c(.eval_time)) %>% \n ggplot() + \n geom_step(aes(.eval_time, num_usable)) +\n labs(x = \"time\", y = \"number of usable observations\") +\n lims(y = c(0, nrow(sim_val))) +\n theme_bw()\n\n\n\n\n\n\n\n\nKeeping this in mind, let’s look at what happens with the data points we can use. Let’s start with an evaluation time of 1.00. To compute the confusion matrix for a classification problem, we would simply use:\nbinary_encoding %>% \n filter(.eval_time == 1.00) %>% \n conf_mat(truth = obs_class, estimate = pred_class)\nFor censored regression problems, we need to additionally use the censoring weights so we’ll include them via the case_weights argument:\n\nbinary_encoding %>%\n filter(.eval_time == 1.00) %>%\n conf_mat(truth = obs_class,\n estimate = pred_class,\n case_weights = .weight_censored)\n#> Truth\n#> Prediction event non-event\n#> event 0.00000 0.00000\n#> non-event 14.11046 482.54963\n\nThe values in the cells are the sum of the censoring weights, There are 14 actual events (out of 492 usable observations) before this evaluation time, so there are empty cells. Also note that the cell values are close to the actual counts. This early, the predicted censoring probabilities are very close to one so their inverse values are also.\nThis early, performance looks very good but that is mostly because there are few events.\nLet’s shift to an evaluation time of 5.0.\n\nbinary_encoding %>%\n filter(.eval_time == 5.00) %>%\n conf_mat(truth = obs_class,\n estimate = pred_class,\n case_weights = .weight_censored)\n#> Truth\n#> Prediction event non-event\n#> event 112.98854 54.36531\n#> non-event 56.14133 252.41039\n\nNow we have fewer total observations to consider (391 instead of 492 usable values) and more events (154 this time). Performance is fairly good; the sensitivity is 66.8% and the specificty is 82.3%.\nWhat happends when the evaluation time is 17?\n\nbinary_encoding %>%\n filter(.eval_time == 17.00) %>%\n conf_mat(truth = obs_class,\n estimate = pred_class,\n case_weights = .weight_censored)\n#> Truth\n#> Prediction event non-event\n#> event 429.9920 123.4458\n#> non-event 0.0000 0.0000\n\nThe data are overwhelmingly events. Also, the censoring weights are larger now since the probability of being censored is very low. The mean censoring weight is 1.96.\nThis concludes the illustration of how to account for censoring when using a confusion matrix for performance assessment. There’s more on dynamic performance metrics in the Dynamic Performance Metrics for Event Time Data article." + }, + { + "objectID": "learn/statistics/survival-metrics-details/index.html#summary", + "href": "learn/statistics/survival-metrics-details/index.html#summary", + "title": "Accounting for Censoring in Performance Metrics for Event Time Data", + "section": "Summary", + "text": "Summary\nWhen accounting for censoring in dynamic performance metrics, the main points to remember are:\n\nEvent time data can be converted to a binary format.\nSome data points cannot be used in the calculations.\nTo properly estimate statistical quantities, we weight the computations by the inverse of the probability of being censored.\ntidymodels currently assumes non-informative censoring." + }, + { + "objectID": "learn/statistics/survival-metrics-details/index.html#session-info", + "href": "learn/statistics/survival-metrics-details/index.html#session-info", + "title": "Accounting for Censoring in Performance Metrics for Event Time Data", + "section": "Session information", + "text": "Session information\n\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.3.0 (2023-04-21)\n#> os macOS Ventura 13.4\n#> system aarch64, darwin20\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/Los_Angeles\n#> date 2023-07-02\n#> pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.4 2023-03-11 [1] CRAN (R 4.3.0)\n#> censored * 0.2.0.9000 2023-07-02 [1] Github (tidymodels/censored@f9eccb6)\n#> dials * 1.2.0 2023-04-03 [1] CRAN (R 4.3.0)\n#> dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)\n#> ggplot2 * 3.4.2 2023-04-03 [1] CRAN (R 4.3.0)\n#> infer * 1.0.4 2022-12-02 [1] CRAN (R 4.3.0)\n#> parsnip * 1.1.0.9003 2023-07-02 [1] Github (tidymodels/parsnip@e6278b0)\n#> prodlim * 2023.03.31 2023-04-02 [1] CRAN (R 4.3.0)\n#> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)\n#> recipes * 1.0.6 2023-04-25 [1] CRAN (R 4.3.0)\n#> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n#> rsample * 1.1.1 2022-12-07 [1] CRAN (R 4.3.0)\n#> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n#> tidymodels * 1.1.0 2023-05-01 [1] CRAN (R 4.3.0)\n#> tune * 1.1.1 2023-04-11 [1] CRAN (R 4.3.0)\n#> workflows * 1.1.3 2023-02-22 [1] CRAN (R 4.3.0)\n#> yardstick * 1.2.0 2023-04-21 [1] CRAN (R 4.3.0)\n#> \n#> [1] /Users/emilhvitfeldt/Library/R/arm64/4.3/library\n#> [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────" + }, { "objectID": "learn/statistics/tidy-analysis/index.html", "href": "learn/statistics/tidy-analysis/index.html", diff --git a/installs.R b/installs.R index 5ef62c8f..02caea12 100644 --- a/installs.R +++ b/installs.R @@ -6,6 +6,7 @@ packages <- c( "blogdown", "broom.mixed", "brulee", + "censored", "devtools", "dials", "discrim", @@ -34,6 +35,7 @@ packages <- c( "plsmod", "poissonreg", "probably", + "prodlim", "randomForest", "ranger", "rlang", diff --git a/learn/statistics/survival-metrics-details/figs/RKM-1.svg b/learn/statistics/survival-metrics-details/figs/RKM-1.svg new file mode 100644 index 00000000..5552accd --- /dev/null +++ b/learn/statistics/survival-metrics-details/figs/RKM-1.svg @@ -0,0 +1,1578 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +0.00 +0.25 +0.50 +0.75 +1.00 + + + + + + + + + + +0 +5 +10 +15 +20 +time +probability of being censored + + diff --git a/learn/statistics/survival-metrics-details/figs/plot-graf-categories-1.svg b/learn/statistics/survival-metrics-details/figs/plot-graf-categories-1.svg new file mode 100644 index 00000000..9af78453 --- /dev/null +++ b/learn/statistics/survival-metrics-details/figs/plot-graf-categories-1.svg @@ -0,0 +1,212 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +1 + + + + + + + + + + +3 + + + + + + + + + + +5 + + + + + + +0 +2 +4 +6 + + + + +0 +2 +4 +6 + + + + +0 +2 +4 +6 + + + + + +Time +Sample + +Status + + + + + + + + + + + + +Observation: censored +Observation: event +Evaluation: non-event +Evaluation: event + + diff --git a/learn/statistics/survival-metrics-details/figs/usable-data-1.svg b/learn/statistics/survival-metrics-details/figs/usable-data-1.svg new file mode 100644 index 00000000..e0e78504 --- /dev/null +++ b/learn/statistics/survival-metrics-details/figs/usable-data-1.svg @@ -0,0 +1,82 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +0 +100 +200 +300 +400 +500 + + + + + + + + + + + +0 +5 +10 +15 +20 +time +number of usable observations + + diff --git a/learn/statistics/survival-metrics-details/index.qmd b/learn/statistics/survival-metrics-details/index.qmd new file mode 100644 index 00000000..376c2e10 --- /dev/null +++ b/learn/statistics/survival-metrics-details/index.qmd @@ -0,0 +1,406 @@ +--- +title: "Accounting for Censoring in Performance Metrics for Event Time Data" +categories: + - statistical analysis + - survival analysis +type: learn-subsection +weight: 9 +description: | + Learn how tidymodels uses causal inference tools to measure performance of + survival models. +toc: true +toc-depth: 2 +include-after-body: ../../../resources.html +--- + +```{r} +#| label: "setup" +#| include: false +#| message: false +#| warning: false +source(here::here("common.R")) +``` + +```{r} +#| label: "load" +#| include: false +library(tidymodels) +library(sessioninfo) +pkgs <- c("tidymodels", "censored", "prodlim") +theme_set(theme_bw() + theme(legend.position = "top")) +``` + +## Introduction + +`r article_req_pkgs(pkgs)` You'll need the development versions of censored and parsnip. To install these, use + +```r +#install.packages("pak") + +pak::pak(c("tidymodels/censored", "tidymodels/parsnip")) +``` + +One trend in modern survival analysis is to compute time-dependent measures of performance. These are primarily driven by an increased focus on predictions for the probability of survival at a given time (as opposed to the predictions of event times or linear predictors). Since these are conditional on the time of evaluation, we call them dynamic performance metrics. + +Many dynamic metrics are similar to those used for binary classification models. Examples include the Brier score and ROC curves (see the [Dynamic Performance Metrics for Event Time Data](../survival-metrics/) article for details). The basic idea is that, for a given time $t$ for model evaluation, we try to encode the observed event time data into a binary "has there been an event at time $t$?" version. We also convert the predicted survival probabilities into predicted events/non-events based on a threshold (default is 0.50). The survival versions of these metrics need those binary versions of observed truth and predictions as well as a way to account for censoring. + +Censoring plays into the details of the conversion and is additionally captured in the form of weights. This article covers both those aspects in detail to complement the [main article](../survival-metrics/) on performance metrics for event time data. + +To start, let's define the various types of times that will be mentioned: + +- Observed time: time recorded in the data +- Event time: observed times for actual events +- Evaluation time: the time, specified by the analyst, that the model is evaluated. + +## Example data + +As an example, we'll simulate some data with the prodlim package, using the methods of [Bender _et al_ (2005)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=%22Generating+survival+times+to+simulate+Cox+proportional+hazards+models.%22&btnG=). A training and a validation set are simulated. We'll also load the censored package so that we can fit a model to these time-to-event data: + +```{r} +#| label: data +library(tidymodels) +library(censored) +library(prodlim) + +set.seed(5882) +sim_dat <- SimSurv(2000) %>% + mutate(event_time = Surv(time, event)) %>% + select(event_time, X1, X2) + +set.seed(2) +split <- initial_split(sim_dat) +sim_tr <- training(split) +sim_val <- testing(split) +``` + +We'll need a model to illustrate the code and concepts. Let's fit a bagged survival tree model to the training set: + +```{r} +#| label: bag-tree-fit +set.seed(17) +bag_tree_fit <- + bag_tree() %>% + set_mode("censored regression") %>% + set_engine("rpart") %>% + fit(event_time ~ ., data = sim_tr) +bag_tree_fit +``` + +Using this model, we can make predictions of different types and `augment()` provides us with a version of the data augmented with the various predictions. Here we are interested in the predicted probability of survival at different evaluation time points. The largest event time in the training set is `r round(max(sim_tr$event_time[,1]), 3)` so we will use a set of evaluation times between zero and 21. + +```{r} +#| label: val-pred +time_points <- seq(0, 21, by = 0.25) + +val_pred <- augment(bag_tree_fit, sim_val, eval_time = time_points) +val_pred +``` + +The observed data are in the `event_time` column. The predicted survival probabilities are in the `.pred` column. This is a list column with a data frame for each observation, containing the predictions at the `r length(time_points)` evaluation time points in the (nested) column `.pred_survival`. + +```{r} +#| label: val-pred-dynamic +val_pred$.pred[[1]] +``` + +First, let's dive into how to convert the observed event time in `event_time` to a binary version. Then we will discuss the remaining columns as part of generating the required weights for the dynamic performance metrics. + + +## Converting censored data to binary data + +To assess model performance at evaluation time $t$, we turn the observed event time data into a binary “was there an event at time $t$?” version. For this, we follow the process described by [Graf _et al_ (1999)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=%22Assessment+and+Comparison+of+Prognostic+Classification+Schemes+for+Survival+Data.%22&btnG=) where observations are categorized into three groups, at evaluation time $t$. + + - **Category 1 - Events**: Evaluation time is greater than or equal to the event time ("it has already happened"). + - **Category 2 - Non-events**: Evaluation time is less than the observed time, censored or not ("nothing has happened yet"). + - **Category 3 - Ambiguous outcomes**: Evaluation time is greater than or equal to the observed censored time ("we don't know if anything might have happened by now"). + +We can use binary versions of the observations in the first two categories to compute binary performance metrics, but the observations in the third category are not used directly in these calculations. (They do influence the calculation of the weights, see next section.) So our usable sample size changes with the evaluation time. + +```{r} +#| label: plot-graf-categories +#| echo: false +#| warning: false +#| fig.width: 9 +obs_time <- c(4, 2) +obs_status <- c("censored", "event") + +df1 <- tibble::tibble( + obs_id = 1:2, + obs_time = obs_time, + obs_status = obs_status, + eval_time = 1, + eval_status = c("censored", "censored") +) +df2 <- tibble::tibble( + obs_id = 1:2, + obs_time = obs_time, + obs_status = obs_status, + eval_time = 3, + eval_status = c("censored", "event") +) +df3 <- tibble::tibble( + obs_id = 1:2, + obs_time = obs_time, + obs_status = obs_status, + eval_time = 5, + eval_status = c(NA, "event") +) +df <- bind_rows(df1, df2, df3) + +pch_dot_empty <- 1 +pch_dot_solid <- 19 +pch_triangle_empty <- 2 +pch_triangle_solid <- 17 + +df %>% + dplyr::mutate( + obs_status = dplyr::if_else(obs_status == "censored", pch_dot_empty, pch_dot_solid), + eval_status = dplyr::if_else(eval_status == "censored", pch_triangle_empty, pch_triangle_solid) + ) %>% + ggplot() + + geom_point(aes(obs_time, obs_id, shape = obs_status, size = I(5))) + + geom_segment(aes(x = rep(0, 6), y = obs_id, xend = obs_time, yend = obs_id)) + + geom_vline(aes(xintercept = eval_time, col = I("red"), linetype = I("dashed"), linewidth = I(0.8))) + + geom_point(aes(eval_time, obs_id, shape = eval_status, col = I("red"), size = I(5))) + + scale_shape_identity("Status", + labels = c("Observation: censored", "Observation: event", + "Evaluation: non-event", "Evaluation: event"), + breaks = c(1, 19, 2, 17), + guide = "legend") + + scale_x_continuous(limits = c(0, 7)) + + scale_y_continuous(limits = c(0.5, 2.5)) + + labs(x = "Time", y = "Sample") + + theme_bw() + + theme(axis.text.y = element_blank(), legend.position = "top") + + facet_grid(~ eval_time) +``` + +## Censoring weights + +Unfortunately, this categorization scheme alone is not sufficient to compute metrics. Graf _et al_ took a page from the causal inference literature and added a propensity-type score based on the likelihood that each data point would be censored (regardless of the observed event status). This is not the probability than the original time-to-event data point is censored but rather the probability that at evaluation time, we have not observed an event (or a censoring) yet, i.e., that the data point falls into category 2. + +How do we compute this probability? The standard approach is to compute a "reverse Kaplan-Meier" (RKM) curve. Ordinarily, the Kaplan-Meier (KM) curve models the probability of survival. For the reverse Kaplan-Meier curve, the meaning of the status indicator is flipped, i.e., the event of interest changes from "death" to "censoring". This should give us a reasonably reliable non-parametric model for estimating the probability of being censored at a given time. + +Every time a censored regression model is created using tidymodels, the RKM is estimated on the same data used to fit the model and attached to the parsnip object. + +For our simulated data, here is what the RKM curve looks like: + +```{r} +#| label: RKM +#| echo : false +dyn_val_pred <- + val_pred %>% + select(.pred, event_time) %>% + add_rowindex() %>% + unnest(.pred) + +dyn_val_pred %>% + select(.weight_time, .pred_censored) %>% + distinct() %>% + ggplot() + + geom_step(aes(.weight_time, .pred_censored)) + + ylim(0:1) + + geom_rug( + data = sim_tr %>% filter(event_time[,2] == 1), + aes(x = event_time[,1]), + alpha = 1 / 2, col = "red" + ) + + geom_rug( + data = sim_tr %>% filter(event_time[,2] == 0), + aes(x = event_time[,1]), + alpha = 1 / 2, col = "blue", sides = "t" + ) + + labs(x = "time", y = "probability of being censored") + + theme_bw() +``` + +The red rug on the bottom shows the training point event times and the blue values at the top represent the times for the censored training set observations. As (evaluation) time increases, we pass more and more observed time points, and the probability of being censored, i.e., the probability of an observation to fall into category 2, decreases. + +The weights used in the calculation of the dynamic performance metrics are the inverse of these probabilities, hence the name "inverse probability of censoring weights" (IPCW). These weights should theoretically balance the exposure/effect/influence that the definitive observations have on performance calculations. + +### The finer details + +First, when do we evaluate the probability of censoring? There are different approaches used in the literature, and we follow what Graf _et al_ suggest (as it seems most appropriate): + +- If the evaluation time is less than the observed time (like in category 2), the evaluation time is used to predict the probability of censoring. +- If the evaluation is greater than or equal to the event time (like in category 1), the event time is used to predict the probability of censoring. +- If the evaluation time is greater than or equal to the observed censoring time, the observation falls into category 3 and is not used, i.e., also no weight is needed. + +We call this time at which to predict the probability of censoring the _weight time_. Here's an example using the first data point in the validation set: + +```{r} +#| label: eval-time-censored +dyn_val_pred <- + val_pred %>% + select(.pred, event_time) %>% + add_rowindex() %>% + unnest(.pred) + +dyn_val_pred %>% + filter(.row == 1 & .eval_time %in% c(1, 2, 4, 5, 10)) %>% + select(event_time, .eval_time, .weight_time, .pred_censored, .weight_censored) +``` + +```{r} +#| label: example-time-hide +#| include: false +time_1 <- parsnip::.extract_surv_time(sim_val$event_time[1]) +``` + +This observation was an event, observed at time `r round(time_1, 3)` The column `.weight_time` captures at which time the probability of censoring was calculated. Up until that event time, the probability of being censored is computed at the evaluation time. After that, it is based on the event time. + +We add a slight modification to the weight time: If our evaluation time is today, we don't have today's data yet. In tidymodels, we calculate the probability of censoring just before the requested weight time. We are basically subtracting a small numerical value from the weight time used in the RKM model. You'll only really see a difference if there is a bulk of censored observations at the original evaluation time. + +Finally, we use a simple RKM curve (i.e., no covariates or strata). This implies that there is non-informative censoring. Other applications of IPCW try to mitigate the effects of informative censoring. In the future, we may allow the censoring model to include covariates (as well as models beyond the RKM). + +## Illustration: Confusion matrix + +To illustrate how these two tools for accounting for censoring are used in calculating dynamic performance metrics, we'll take a look here at the 2x2 confusion matrices at a few evaluation time points. More details on performance metrics for censored data are in the aforementioned [Dynamic Performance Metrics for Event Time Data](../survival-metrics/) article. + +First, let's turn the observed event time data and the predictions into their binary versions. + +```{r} +#| label: binary-encoding +time_as_binary_event <- function(surv, eval_time) { + event_time <- .extract_surv_time(surv) + status <- .extract_surv_status(surv) + is_event_before_t <- event_time <= eval_time & status == 1 + + # Three possible contributions to the statistic from Graf 1999 + # Censoring time before eval_time, no contribution (Graf category 3) + binary_res <- rep(NA_character_, length(event_time)) + + # A real event prior to eval_time (Graf category 1) + binary_res <- if_else(is_event_before_t, "event", binary_res) + + # Observed time greater than eval_time (Graf category 2) + binary_res <- if_else(event_time > eval_time, "non-event", binary_res) + factor(binary_res, levels = c("event", "non-event")) +} + +binary_encoding <- + dyn_val_pred %>% + mutate( + obs_class = time_as_binary_event(event_time, .eval_time), + pred_class = if_else(.pred_survival >= 1 / 2, "non-event", "event"), + pred_class = factor(pred_class, levels = c("event", "non-event")), + ) +``` + +Remember how observations falling into category 3 are removed from the analysis? This means we'll likely have fewer data points to evaluate as the evaluation time increases. This implies that the variation in the metrics will be considerable as evaluation time goes on. For our simulated training set: + +```{r} +#| label: usable-data +dyn_val_pred %>% + summarize(num_usable = sum(!is.na(.weight_censored)), .by = c(.eval_time)) %>% + ggplot() + + geom_step(aes(.eval_time, num_usable)) + + labs(x = "time", y = "number of usable observations") + + lims(y = c(0, nrow(sim_val))) + + theme_bw() +``` + +Keeping this in mind, let's look at what happens with the data points we can use. Let's start with an evaluation time of 1.00. To compute the confusion matrix for a classification problem, we would simply use: + +```r +binary_encoding %>% + filter(.eval_time == 1.00) %>% + conf_mat(truth = obs_class, estimate = pred_class) +``` + +For censored regression problems, we need to additionally use the censoring weights so we'll include them via the `case_weights` argument: + +```{r} +#| label: conf-mat-01 +binary_encoding %>% + filter(.eval_time == 1.00) %>% + conf_mat(truth = obs_class, + estimate = pred_class, + case_weights = .weight_censored) +``` + + +```{r} +#| label: conf-mat-01-hide +#| include: false +dat_01 <- binary_encoding %>% filter(.eval_time == 1.00 & !is.na(.weight_censored)) +events_01 <- nrow(dat_01 %>% filter(event_time[,1] <= 1)) +``` + +The values in the cells are the sum of the censoring weights, There are `r events_01` actual events (out of `r nrow(dat_01)` usable observations) before this evaluation time, so there are empty cells. Also note that the cell values are close to the actual counts. This early, the predicted censoring probabilities are very close to one so their inverse values are also. + +This early, performance looks very good but that is mostly because there are few events. + +Let's shift to an evaluation time of 5.0. + +```{r} +#| label: conf-mat-05 +binary_encoding %>% + filter(.eval_time == 5.00) %>% + conf_mat(truth = obs_class, + estimate = pred_class, + case_weights = .weight_censored) +``` + +```{r} +#| label: conf-mat-05-hide +#| include: false +dat_05 <- binary_encoding %>% filter(.eval_time == 5.00 & !is.na(.weight_censored)) +events_05 <- nrow(dat_05 %>% filter(event_time[,1] <= 5)) +cls_set <- metric_set(accuracy, sens, spec) +stats_05 <- + binary_encoding %>% + filter(.eval_time == 5.00) %>% + cls_set(truth = obs_class, + estimate = pred_class, + case_weights = .weight_censored) +``` + +Now we have fewer total observations to consider (`r nrow(dat_05)` instead of `r nrow(dat_01)` usable values) and more events (`r events_05` this time). Performance is fairly good; the sensitivity is `r round(stats_05$.estimate[2] * 100, 1)`% and the specificty is `r round(stats_05$.estimate[3] * 100, 1)`%. + +What happends when the evaluation time is 17? + +```{r} +#| label: conf-mat-17 +binary_encoding %>% + filter(.eval_time == 17.00) %>% + conf_mat(truth = obs_class, + estimate = pred_class, + case_weights = .weight_censored) +``` + +```{r} +#| label: conf-mat-17-hide +#| include: false +dat_17 <- binary_encoding %>% filter(.eval_time == 17.00 & !is.na(.weight_censored)) +events_17 <- nrow(dat_17 %>% filter(event_time[,1] <= 17)) +stats_17 <- + binary_encoding %>% + filter(.eval_time == 17.00) %>% + cls_set(truth = obs_class, + estimate = pred_class, + case_weights = .weight_censored) +``` + +The data are overwhelmingly events. Also, the censoring weights are larger now since the probability of being censored is very low. The mean censoring weight is `r round(mean(dat_17$.weight_censored), 2)`. + +This concludes the illustration of how to account for censoring when using a confusion matrix for performance assessment. There's more on dynamic performance metrics in the [Dynamic Performance Metrics for Event Time Data](../survival-metrics/) article. + +## Summary + +When accounting for censoring in dynamic performance metrics, the main points to remember are: + +* Event time data can be converted to a binary format. +* Some data points cannot be used in the calculations. +* To properly estimate statistical quantities, we weight the computations by the inverse of the probability of being censored. +* tidymodels currently assumes non-informative censoring. + + +## Session information {#session-info} + +```{r} +#| label: "si" +#| echo: false +small_session(pkgs) +``` + diff --git a/learn/statistics/survival-metrics/figs/brier-scores-1.svg b/learn/statistics/survival-metrics/figs/brier-scores-1.svg new file mode 100644 index 00000000..1693b1e4 --- /dev/null +++ b/learn/statistics/survival-metrics/figs/brier-scores-1.svg @@ -0,0 +1,168 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +0.00 +0.05 +0.10 +0.15 +0.20 +0.25 + + + + + + + + + + + +0 +5 +10 +15 +20 +time +Brier score + + diff --git a/learn/statistics/survival-metrics/figs/roc-5-1.svg b/learn/statistics/survival-metrics/figs/roc-5-1.svg new file mode 100644 index 00000000..c85ee4e0 --- /dev/null +++ b/learn/statistics/survival-metrics/figs/roc-5-1.svg @@ -0,0 +1,88 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +0.00 +0.25 +0.50 +0.75 +1.00 + + + + + + + + + + +0.00 +0.25 +0.50 +0.75 +1.00 +1 - specificity +sensitivity + + diff --git a/learn/statistics/survival-metrics/figs/roc-scores-1.svg b/learn/statistics/survival-metrics/figs/roc-scores-1.svg new file mode 100644 index 00000000..06015ad7 --- /dev/null +++ b/learn/statistics/survival-metrics/figs/roc-scores-1.svg @@ -0,0 +1,165 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +0.5 +0.6 +0.7 +0.8 +0.9 + + + + + + + + + + +0 +5 +10 +15 +20 +time +ROC AUC + + diff --git a/learn/statistics/survival-metrics/figs/surv-hist-05-1.svg b/learn/statistics/survival-metrics/figs/surv-hist-05-1.svg new file mode 100644 index 00000000..5aa4e8ab --- /dev/null +++ b/learn/statistics/survival-metrics/figs/surv-hist-05-1.svg @@ -0,0 +1,190 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +non-event + + + + + + + + + + +event + + + + + + + +0.00 +0.25 +0.50 +0.75 +1.00 +0 +25 +50 +75 + + + + +0 +25 +50 +75 + + + + +probability of survival +sum of weights + + diff --git a/learn/statistics/survival-metrics/index.qmd b/learn/statistics/survival-metrics/index.qmd new file mode 100644 index 00000000..804ecc31 --- /dev/null +++ b/learn/statistics/survival-metrics/index.qmd @@ -0,0 +1,333 @@ +--- +title: "Dynamic Performance Metrics for Event Time Data" +categories: + - statistical analysis + - survival analysis +type: learn-subsection +weight: 9 +description: | + Let's discuss how to compute modern performance metrics for time-to-event models. +toc: true +toc-depth: 2 +include-after-body: ../../../resources.html +--- + +```{r} +#| label: "setup" +#| include: false +#| message: false +#| warning: false +source(here::here("common.R")) +``` + +```{r} +#| label: "load" +#| include: false +library(tidymodels) +library(sessioninfo) +pkgs <- c("tidymodels", "censored", "prodlim") +theme_set(theme_bw() + theme(legend.position = "top")) +``` + +## Introduction + +`r article_req_pkgs(pkgs)` You'll need the development versions of censored and parsnip. To install these, use + +```r +#install.packages("pak") + +pak::pak(c("tidymodels/censored", "tidymodels/parsnip")) +``` + +One trend in modern survival analysis is to compute time-dependent measures of performance. These are primarily driven by an increased focus on predictions for the probability of survival at a given time (as opposed to the predictions of event times or linear predictors). Since these are conditional on the time of evaluation, we call them dynamic performance metrics. + +Many dynamic metrics are similar to those used for binary classification models. The basic idea is that, for a given time $t$ for model evaluation, we try to encode the observed event time data into a binary "has there been an event at time $t$?" version. We can also convert the predicted survival probabilities into predicted events/non-events based on a threshold (default is 0.50). The survival versions of these metrics need those binary versions of observed truth and predictions as well as a way to account for censoring. + +Censoring plays into the details of the conversion and is additionally captured in the form of weights. For details on both these aspects, see the [Accounting for Censoring in Performance Metrics for Event Time Data](../survival-metrics-details) article. + +To start, let's define the various types of times that will be mentioned: + +- Observed time: time recorded in the data +- Event time: observed times for actual events +- Evaluation time: the time, specified by the analyst, that the model is evaluated. + +## Example data + +As an example, we'll simulate some data with the prodlim package, using the methods of [Bender _et al_ (2005)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=%22Generating+survival+times+to+simulate+Cox+proportional+hazards+models.%22&btnG=). A training and a validation set are simulated. We'll also load the censored package so that we can fit a model to these time-to-event data: + +```{r} +#| label: data +library(tidymodels) +library(censored) +library(prodlim) + +set.seed(5882) +sim_dat <- SimSurv(2000) %>% + mutate(event_time = Surv(time, event)) %>% + select(event_time, X1, X2) + +set.seed(2) +split <- initial_split(sim_dat) +sim_tr <- training(split) +sim_val <- testing(split) + +## Resampling object +sim_rs <- vfold_cv(sim_tr) +``` + +We'll need a model to illustrate the code and concepts. Let's fit a bagged survival tree model to the training set: + +```{r} +#| label: bag-tree-fit +set.seed(17) +bag_tree_fit <- + bag_tree() %>% + set_mode("censored regression") %>% + set_engine("rpart") %>% + fit(event_time ~ ., data = sim_tr) +bag_tree_fit +``` + +Using this model, we'll make predictions of different types. + +## Survival Probability Prediction + +This censored regression model can make static predictions via the predicted event time using `predict(object, type = "time")`. It can also create dynamic predictions regarding the probability of survival for each data point at specific times. The syntax for this is + +```r +predict(object, new_data, type = "survival", eval_time = numeric()) +``` + +where `eval_time` is a vector of time points at which we want the corresponding survivor function estimates. Alternatively, we can use the `augment()` function to get both types of prediction and automatically attach them to the data. + +The largest event time in the training set is `r round(max(sim_tr$event_time[,1]), 3)` so we will use a set of evaluation times between zero and 21. + +```{r} +#| label: val-pred +time_points <- seq(0, 21, by = 0.25) + +val_pred <- augment(bag_tree_fit, sim_val, eval_time = time_points) +val_pred +``` + +The observed data are in the `event_time` column. The predicted survival probabilities are in the `.pred` column. This is a list column with a data frame for each observation, containing the predictions at the `r length(time_points)` evaluation time points in the (nested) column `.pred_survival`. + +```{r} +#| label: val-pred-dynamic +val_pred$.pred[[1]] +``` + +The yardstick package currently has two dynamic metrics. Each is described below. + +## Brier Score + +The Brier score is a metric that can be used with both classification and event-time models. For classification models, it computes the squared error between the observed outcome (encoded as 0/1) and the corresponding predicted probability for the class. + +A little math: suppose that the value $y_{ik}$ is a 0/1 indicator for whether the observed outcome $i$ corresponds to class $k$, and $\hat{p}_{ik}$ is the estimated class probability. The classification score is then: + +$$ +Brier_{class} = \frac{1}{N}\sum_{i=1}^N\sum_{k=1}^C (y_{ik} - \hat{p}_{ik})^2 +$$ + +For survival models, we transform the event time data into a binary version $y_{it}$: is there an event at evaluation time $t$^[Again, see the [Accounting for Censoring in Performance Metrics for Event Time Data](../survival-metrics-details) article for more on this.]. The survival function estimate $\hat{p}_{it}$ is the probability that corresponds to non-events at time $t$. For example, if there has not been an event at the current evaluation time, our best model should estimate the survival probability to be near one. For observations that are events, the probability estimate is just one minus the survivor estimate. To account for censoring, we also weight each observation with $w_{it}$. The [time-dependent Brier score](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=%22Assessment+and+Comparison+of+Prognostic+Classification+Schemes+for+Survival+Data.%22&btnG=) is: + +$$ +Brier_{surv}(t) = \frac{1}{N}\sum_{i=1}^N w_{it}\left[\underbrace{I(y_{it} = 0)(y_{it} - \hat{p}_{it})^2}_\text{non-events} + \underbrace{I(y_{it} = 1)(y_{it} - (1 - \hat{p}_{it}))^2}_\text{events}\right] +$$ + +For this score, a perfect model has a score of zero, while an uninformative model would have a score of around 1/4. + +How do we compute this using the yardstick package? The function [`brier_survival()`](https://yardstick.tidymodels.org/reference/brier_survival.html) follows the same convention as the other metric functions. The main arguments are: + +- `data`: the data frame with the predictions (structured as the output produced by `augment()`, as shown above). +- `truth`: the name of the column with the `Surv` object. +- `...`: the name of the column with the dynamic predictions. Within tidymodels, this column is always called `.pred`. In other words, `.pred` should be passed without an argument name. + +Since the evaluation times and the case weights are within the `.pred` column, there is no need to specify these. Here are the results of our validation set: + +```{r} +#| label: val-pred-brier +brier_scores <- + val_pred %>% + brier_survival(truth = event_time, .pred) +brier_scores +``` + +Over time: + +```{r} +#| label: brier-scores +brier_scores %>% + ggplot(aes(.eval_time, .estimate)) + + geom_hline(yintercept = 1 / 4, col = "red", lty = 3) + + geom_line() + + geom_point() + + labs(x = "time", y = "Brier score") +``` + +There is also an _integrated_ Brier score. This required evaluation times as inputs but instead of returning each result, it takes the area under the plot shown above. The syntax is the same but the result has a single row: + +```{r} +#| label: val-pred-brier-int +val_pred %>% brier_survival_integrated(truth = event_time, .pred) +``` + +Again, smaller values are better. + +## Receiver Operating Characteristic (ROC) Curves + +When we not only turn the event time data into a binary representation but also the predicted probabilities, we are in well-chartered classification metrics territory. Sensitivity and specificity are common quantities to compute, we do so here in their weighted version to account for censoring: + +- Sensitivity: How well do we predict the events? This is analogous to the true positive rate. +- Specificity: How well do we predict the non-events? One minus specificity is the false positive rate. + +These depend on the threshold used to turn predicted probabilities into predicted events/non-events. Let's take a look at the distribution of the survival probabilities for our example data at an evaluation time of 5.00. The distributions are separated by the observed class and weighted by the censoring weights. Details of both aspects are the same as for the Brier score and can be found in the [Accounting for Censoring in Performance Metrics for Event Time Data](../survival-metrics-details) article. + +```{r} +#| label: data-at-5 +#| include: false +time_as_binary_event <- function(surv, eval_time) { + event_time <- .extract_surv_time(surv) + status <- .extract_surv_status(surv) + is_event_before_t <- event_time <= eval_time & status == 1 + + # Three possible contributions to the statistic from Graf 1999 + # Censoring time before eval_time, no contribution (Graf category 3) + binary_res <- rep(NA_character_, length(event_time)) + + # A real event prior to eval_time (Graf category 1) + binary_res <- if_else(is_event_before_t, "event", binary_res) + + # Observed time greater than eval_time (Graf category 2) + binary_res <- if_else(event_time > eval_time, "non-event", binary_res) + factor(binary_res, levels = c("event", "non-event")) +} + +# Unnest the list columns and convert the event time data to binary format +binary_encoding <- + val_pred %>% + select(.pred, event_time) %>% + add_rowindex() %>% + unnest(.pred) %>% + mutate( + obs_class = time_as_binary_event(event_time, .eval_time), + pred_class = if_else(.pred_survival >= 1 / 2, "non-event", "event"), + pred_class = factor(pred_class, levels = c("event", "non-event")) + ) + +data_at_5 <- + binary_encoding %>% + filter(.eval_time == 5.00 & !is.na(.weight_censored)) %>% + select(.eval_time, .pred_survival, .weight_censored, obs_class, pred_class, event_time) + +``` + +```{r} +#| label: surv-hist-05 +#| echo: false +#| warning: false +#| out-width: 70% +#| fig-width: 7 +#| fig-height: 7 +data_at_5 %>% + ggplot(aes(x = .pred_survival, weight = .weight_censored)) + + geom_vline(xintercept = 1 / 2, col = "blue", lty = 2) + + geom_histogram(col = "white", bins = 30) + + facet_wrap(~obs_class, ncol = 1) + + lims(x = 0:1) + + labs(x = "probability of survival", y = "sum of weights") + + theme_bw() +``` + + +```{r} +#| label: conf-mat-05-hide +#| include: false +cls_set <- metric_set(accuracy, sens, spec) +stats_05 <- + data_at_5 %>% + mutate( + pred_class = if_else(.pred_survival >= 1 / 2, "non-event", "event"), + pred_class = factor(pred_class, levels = c("event", "non-event")) + ) %>% + cls_set(truth = obs_class, + estimate = pred_class, + case_weights = .weight_censored) + +pred_05 <- augment(bag_tree_fit, sim_val, eval_time = 5) + +curve_05 <- pred_05 %>% roc_curve_survival(truth = event_time, .pred) +auc_05 <- pred_05 %>% roc_auc_survival(truth = event_time, .pred) +``` + +More probability values are to the right of the 50% cutoff for the true non-events. Conversely, true events tend to have smaller probabilities. Using this cutoff, the sensitivity would be `r round(stats_05$.estimate[2] * 100, 1)`% and the specificity would be `r round(stats_05$.estimate[3] * 100, 1)`%. There are other possible cutoffs for the survival probabilities. Maybe one of these would have better statistics. + +ROC curves were designed as a general method that, given a collection of continuous predictions, determines an effective threshold such that values above the threshold indicate a specific event. For our purposes, the ROC curve will compute the sensitivity and specificity for _all possible_ probability thresholds. It then plots the true positive rate versus the false positive rate. Generally, we use the area under the ROC curve to quantify it with a single number. Values near one indicate a perfect model, while values near 1/2 occur with non-informative models. + +[Blanche _et al_ (2013)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=%22Review+and+comparison+of+ROC+curve+estimators+for+a+time-dependent+outcome+with+marker-dependent+censoring%22&btnG=) gives a good overview of ROC curves for survival analysis and their Section 4.3 is most relevant here. + +For our example at evaluation time $t = 5.00$, the ROC curve is: + +```{r} +#| label: roc-5 +#| echo: false +curve_05 %>% + ggplot(aes(1 - specificity, sensitivity)) + + geom_abline(col = "red", lty = 3) + + geom_step(direction = "vh") + + coord_equal() +``` + +The area under this curve is `r round(auc_05$.estimate[1], 3)`. + +Since this is a dynamic metric, we compute the AUC for each evaluation time. The syntax is very similar to the Brier code shown above: + +```{r} +#| label: val-pred-roc +roc_scores <- + val_pred %>% + roc_auc_survival(truth = event_time, .pred) +roc_scores +``` + +Over time: + +```{r} +#| label: roc-scores +roc_scores %>% + ggplot(aes(.eval_time, .estimate)) + + geom_hline(yintercept = 1 / 2, col = "red", lty = 3) + + geom_line() + + geom_point() + + labs(x = "time", y = "ROC AUC") +``` + +The initial variation is due to so few events at the early stages of analysis. + +The ROC measures the separation between classes and the Brier score focuses more on accurate and well-calibrated predictions. It should not be surprising that each metric's results over time differ. + +## Tuning these metrics + +Many of the event time models available in tidymodels have tuning parameters. The `tune_*()` functions and `fit_resamples()` have an `eval_time` argument used to pass the evaluation times. The statistics are computed for these time points using out-of-sample data. + +In some cases, such as [iterative search](https://www.tmwr.org/iterative-search.html) or [racing methods](https://www.tmwr.org/grid-search.html#racing), the functions need a single value to optimize. If a dynamic metric is used to guide the optimization, _the first evaluation time given by the user_ will be used. + +For example, if a model for these data was being optimized, and we wanted a time of 5.00 to guide the process, we would need to use that value of 5.00 as the first element `time_points`, the vector given to the `eval_time` argument in our example above. + + +## Summary + +tidymodels has two time-dependent metrics for characterizing the performance of event-time models: + +* The Brier score measures the distance between the observed class result and the predicted probabilities. +* ROC curves try to measure the separation between the two classes based on the survival probabilities. + + +## Session information {#session-info} + +```{r} +#| label: "si" +#| echo: false +small_session(pkgs) +``` +