How to create a descriptive summary table (‘table 1’) using R

Intro

“Table 1”, that is a table providing the sample characteristics of an empirical study or clinical trial is an obligatory part of scientific publications. Since I started using {R} some ten years ago, I have come across a couple of packages and functions aiming to create such a table. In this blog post, I’ll give an overview about two of these packages and show how to use them.

Data

The data.frame I use is named ‘trial’ and is part of the {gtsummary} package. It contains the following variables:

head(gtsummary::trial)
## # A tibble: 6 × 8
##   trt      age marker stage grade response death ttdeath
##   <chr>  <dbl>  <dbl> <fct> <fct>    <int> <int>   <dbl>
## 1 Drug A    23  0.16  T1    II           0     0    24  
## 2 Drug B     9  1.11  T2    I            1     0    24  
## 3 Drug A    31  0.277 T1    II           0     0    24  
## 4 Drug A    NA  2.07  T3    III          1     1    17.6
## 5 Drug A    51  2.77  T4    III          1     1    16.4
## 6 Drug B    39  0.613 T4    I            0     1    15.6

Apparently, the data.frame contains a treatment variable (“trt”) with two categories (“Drug A” vs. “Drug B”) and several categorical and numerical variables.

{finalfit}

According to the package description, the {finalfit} package has the following purposes:

Generate regression results tables and plots in final format for publication. Explore models and export directly to PDF and ‘Word’ using ‘RMarkdown’.

However, with summary_factorlist(), the package also includes a function to create a table with summary statistics.

library(finalfit)
library(dplyr)
tab.ff <- gtsummary::trial %>%
  mutate(across(c(response, death),
    factor,
    levels = c(1, 0), labels = c("yes", "no")
  )) %>%
  summary_factorlist(
    dependent = "trt", # name of grouping / treatment variable
    explanatory = c("age", "", "marker", "stage", "response", "death", "ttdeath"),
    total_col = TRUE, # add column with statistics for the whole sample
    add_row_total = TRUE, # add column with number of valid cases
    include_row_missing_col = FALSE,
    na_include = TRUE # make variables' missing data explicit
  )

tab.ff
##                   label     Total N    levels      Drug A      Drug B       Total
##                Age, yrs  189 (94.5) Mean (SD) 47.0 (14.7) 47.4 (14.0) 47.2 (14.3)
##     Marker Level, ng/mL  190 (95.0) Mean (SD)   1.0 (0.9)   0.8 (0.8)   0.9 (0.9)
##                 T Stage 200 (100.0)        T1   28 (28.6)   25 (24.5)   53 (26.5)
##                                            T2   25 (25.5)   29 (28.4)   54 (27.0)
##                                            T3   22 (22.4)   21 (20.6)   43 (21.5)
##                                            T4   23 (23.5)   27 (26.5)   50 (25.0)
##                response  193 (96.5)       yes   28 (28.6)   33 (32.4)   61 (30.5)
##                                            no   67 (68.4)   65 (63.7)  132 (66.0)
##                                     (Missing)     3 (3.1)     4 (3.9)     7 (3.5)
##                   death 200 (100.0)       yes   52 (53.1)   60 (58.8)  112 (56.0)
##                                            no   46 (46.9)   42 (41.2)   88 (44.0)
##  Months to Death/Censor 200 (100.0) Mean (SD)  20.2 (5.0)  19.0 (5.5)  19.6 (5.3)

Rather than printing a table, the summary_factorlist() function returns a data.frame which must be further processed and piped to a printing function. The following example shows how these steps may be done using the {labelled} and the {kableExtra} package.

library(labelled)
library(dplyr)
library(kableExtra, exclude = "group_rows")
gtsummary::trial %>%
  mutate(across(c(response, death),
    factor,
    levels = c(1, 0), labels = c("yes", "no")
  )) %>%
  # Add variable labels
  labelled::set_variable_labels(
    age = "Age [yrs]",
    marker = "Marker Level [ng/mL]",
    stage = "T Stage",
    grade = "Grade",
    response = "Tumor Response",
    death = "Patient Died",
    ttdeath = "Months to Death/Censor"
  ) %>%
  summary_factorlist(
    dependent = "trt", # name of grouping / treatment variable
    explanatory = c("age", "", "marker", "stage", "response", "death", "ttdeath"),
    total_col = TRUE, # add column with statistics for the whole sample
    add_row_total = TRUE, # add column with number of valid cases
    include_row_missing_col = FALSE,
    na_include = TRUE # make variables' missing data explicit
  ) %>%
  kbl(
    caption = "Baseline characteristics",
    booktabs = TRUE,
    col.names = c(
      " ", "Total N", " ",
      "Drug A", "Drug B", "Total"
    ),
    align = "lrlrrr",
  ) %>%
  kable_classic(full_width = FALSE)
Baseline characteristics
Total N Drug A Drug B Total
Age [yrs] 189 (94.5) Mean (SD) 47.0 (14.7) 47.4 (14.0) 47.2 (14.3)
Marker Level [ng/mL] 190 (95.0) Mean (SD) 1.0 (0.9) 0.8 (0.8) 0.9 (0.9)
T Stage 200 (100.0) T1 28 (28.6) 25 (24.5) 53 (26.5)
T2 25 (25.5) 29 (28.4) 54 (27.0)
T3 22 (22.4) 21 (20.6) 43 (21.5)
T4 23 (23.5) 27 (26.5) 50 (25.0)
Tumor Response 193 (96.5) yes 28 (28.6) 33 (32.4) 61 (30.5)
no 67 (68.4) 65 (63.7) 132 (66.0)
(Missing) 3 (3.1) 4 (3.9) 7 (3.5)
Patient Died 200 (100.0) yes 52 (53.1) 60 (58.8) 112 (56.0)
no 46 (46.9) 42 (41.2) 88 (44.0)
Months to Death/Censor 200 (100.0) Mean (SD) 20.2 (5.0) 19.0 (5.5) 19.6 (5.3)

{gtsummary}

The description of the {gtsummary} package gives the following information:

The package creates presentation-ready tables summarizing data sets, regression models, and more. The code to create the tables is concise and highly customizable. Data frames can be summarized with any function, e.g. mean(), median(), even user-written functions. Regression models are summarized and include the reference rows for categorical variables. Common regression models, such as logistic regression and Cox proportional hazards regression, are automatically dentified and the tables are pre-filled with appropriate column headers.

For creating a table with summary statistics, the tbl_summary() function is required. In addition, the {dplyr} package should be loaded.

library(gtsummary)
library(dplyr)

The creation of the table works best using the pipe operator:

tbl.gts <- trial %>%
  # categorical variables must be factors or character strings
  mutate(across(c(response, death),
    factor,
    levels = c(1, 0), labels = c("yes", "no")
  )) %>%
  # apply the tbl_summary() function
  tbl_summary(
    by = trt, # Treatment variable
    label = list(
      age ~ "Age [yrs]",
      marker ~ "Marker Level [ng/mL]",
      stage ~ "T Stage",
      grade ~ "Grade",
      response ~ "Tumor Response",
      death ~ "Patient Died",
      ttdeath ~ "Months to Death/Censor"
    ),
    statistic = list(all_continuous() ~ "{mean} ({sd})"),
    # may be also ...
    # statistic = list(all_continuous() ~ "{median} ({p25}, {p75})"),
    digits = all_continuous() ~ 1,
    missing_text = "(Missing)",
    include = everything() # select variables to be included into the table
  ) %>%
  add_overall() %>% # add column with statistics for the whole sample
  add_n() # add column with number of valid cases
tbl.gts
Characteristic N Overall, N = 200 Drug A, N = 981 Drug B, N = 1021
Age [yrs] 189 47.2 (14.3) 47.0 (14.7) 47.4 (14.0)
(Missing) 11 7 4
Marker Level [ng/mL] 190 0.9 (0.9) 1.0 (0.9) 0.8 (0.8)
(Missing) 10 6 4
T Stage 200
T1 53 (26%) 28 (29%) 25 (25%)
T2 54 (27%) 25 (26%) 29 (28%)
T3 43 (22%) 22 (22%) 21 (21%)
T4 50 (25%) 23 (23%) 27 (26%)
Grade 200
I 68 (34%) 35 (36%) 33 (32%)
II 68 (34%) 32 (33%) 36 (35%)
III 64 (32%) 31 (32%) 33 (32%)
Tumor Response 193 61 (32%) 28 (29%) 33 (34%)
(Missing) 7 3 4
Patient Died 200 112 (56%) 52 (53%) 60 (59%)
Months to Death/Censor 200 19.6 (5.3) 20.2 (5.0) 19.0 (5.5)

1 Statistics presented: mean (SD); n (%)

Since the {gtsummary} packages contains functions to convert the {gtsummary} object to object types required by other popular table-specific R packages, e.g. as_kable_extra(), as_flextable() etc., {gtsummary} tables can be easily rendered as .docx, .pdf or .html. The following example shows how to print a {gtsummary} object using the {kableExtra} package.

library(kableExtra)
tbl.gts %>%
  as_kable_extra(
    caption = "Baseline characteristics",
    booktabs = TRUE,
    align = "lcccc",
  ) %>%
  kable_classic(full_width = FALSE)
Baseline characteristics
Characteristic N Overall, N = 200 Drug A, N = 98 Drug B, N = 102
Age [yrs] 189 47.2 (14.3) 47.0 (14.7) 47.4 (14.0)
(Missing) 11 7 4
Marker Level [ng/mL] 190 0.9 (0.9) 1.0 (0.9) 0.8 (0.8)
(Missing) 10 6 4
T Stage 200
T1 53 (26%) 28 (29%) 25 (25%)
T2 54 (27%) 25 (26%) 29 (28%)
T3 43 (22%) 22 (22%) 21 (21%)
T4 50 (25%) 23 (23%) 27 (26%)
Grade 200
I 68 (34%) 35 (36%) 33 (32%)
II 68 (34%) 32 (33%) 36 (35%)
III 64 (32%) 31 (32%) 33 (32%)
Tumor Response 193 61 (32%) 28 (29%) 33 (34%)
(Missing) 7 3 4
Patient Died 200 112 (56%) 52 (53%) 60 (59%)
Months to Death/Censor 200 19.6 (5.3) 20.2 (5.0) 19.0 (5.5)
1 Statistics presented: mean (SD); n (%)

Other packages

Other packages I have used in the past are {qwraps2} and {tableone}. However, the {gtsummary} package seems to offer the most convenient way to create a “table one”.

Author: norbert

Biometrician at Clinical Trial Centre, Leipzig University (GER), with degrees in sociology (MA) and public health (MPH).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: