Introduction to summarytools (2024)

Dominic Comtois

2022-05-19

  • 1. Overview
  • 2. Frequency Tables:freq()
  • 3. Cross-Tabulations:ctable()
  • 4. Descriptive Statistics:descr()
  • 5. Data Frame Summaries:dfSummary()
  • 6. Grouped Statistics:stby()
  • 7. Grouped Statistics:group_by()
  • 8. Tidy Tables : tb()
  • 9. Directing Output toFiles
  • 10. Package Options
  • 11. Format Attributes
  • 12. Fine-Tuning Looks :CSS
  • 13. Shiny Apps
  • 14. Graphs in R Markdown
  • 16. Vignette Setup
  • 17. Conclusion

summarytools provides a coherent set of functionscentered on data exploration and simple reporting. At its core residethe following four functions:

FunctionDescription
freq()Frequency Tables featuring counts, proportions,cumulative statistics as well as missing data reporting
ctable()Cross-Tabulations (joint frequencies) between pairs ofdiscrete/categorical variables, featuring marginal sums as well as row,column or total proportions
descr()Descriptive (Univariate) Statistics for numerical data,featuring common measures of central tendency and dispersion
dfSummary()Data Frame Summaries featuring type-specificinformation for all variables: univariate statistics and/or frequencydistributions, bar charts or histograms, as well as missing data countsand proportions. Very useful to quickly, detect anomalies and identifytrends at a glance

1.1 Motivation

The package was developed with the following objectives in mind:

  • Provide a coherent set of easy-to-use descriptive functions that areakin to those included in commercial statistical software suites such asSAS, SPSS, and Stata
  • Offer flexibility in terms of output format & content
  • Integrate well with commonly used software & tools for reporting(the RStudioIDE, Rmarkdown, and knitr) while also allowing forstandalone, simple report generation using any R interface

1.2 Directing Output

Results can be

  • Displayed in the R console as plain text
  • Rendered as html and shown in a Web browser or in RStudio’sViewer Pane
  • Written / appended to plain text, markdown, orhtml files

When creating R Markdown documents, makesure to

  • Use chunk optionresult="asis"
  • Une the function argument plain.ascii=FALSE
  • Set the style parameter to “rmarkdown”, or “grid” fordfSummary()

1.3 Other Characteristics

  • Weights-enabled: freq(),ctable() and descr() support samplingweights
  • Multilingual:
    • Built-in translations exist for French, Portuguese, Spanish,Russian, and Turkish. Users can easily add custom translations or modifyexisting ones as needed
  • Flexible and extensible:
    • The built-in features used to support alternate languages provide away to modify a great number of terms used in outputs (headings andtables)
    • Pipe operators from magrittr(%>%, %$%) and pipeR(%>>%) are fully supported; the native|> introduced in R 4.0 is supported as well
    • Default values for a good number of function parameters can bemodified using st_options() to minimize redundancy infunction calls
    • By-group processing is easily achieved using thepackage’s stby() function which is a slightly modifiedversion of base base::by(), butdplyr::group_by() is also supported
    • Panderoptions can be used to customize or enhance plain text andmarkdown tables
    • Base R’s format() parameters are also supported; thiscan be used to set thousands separator or modify the decimal separator,among several other possibilities (see help("format"))
    • BootstrapCSS is used by default with html output, anduser-defined classes can be added at will

<< 1. Overview | TOC |3. Cross-Tabulations: ctable() >>

The freq() function generates frequencytables with counts, proportions, as well as missing datainformation. Side note: the very idea for creating this package stemmedfrom the absence of such a function in base R.

freq(iris$Species, plain.ascii = FALSE, style = "rmarkdown")

Frequencies

iris$Species
Type: Factor

Freq% Valid% Valid Cum.% Total% Total Cum.
setosa5033.3333.3333.3333.33
versicolor5033.3366.6733.3366.67
virginica5033.33100.0033.33100.00
<NA>00.00100.00
Total150100.00100.00100.00100.00

In this first example, the plain.ascii andstyle arguments were specified. However, since we havedefined them globally for this document using st_options(),they are redundant and will be omitted from hereon(section 16 contains a detailed description of thisvignette’s configuration).

2.1 Missing Data

One of summarytools’ main purposes is to helpcleaning and preparing data for further analysis. But in somecircumstances, we don’t need (or already have) information about missingdata. Using report.nas = FALSE makes the output tablesmaller by one row and two columns:

freq(iris$Species, report.nas = FALSE, headings = FALSE)
Freq%% Cum.
setosa5033.3333.33
versicolor5033.3366.67
virginica5033.33100.00
Total150100.00100.00
Introduction to summarytools (1)The headings = FALSE parameter suppresses the headingsection.

2.2 Simplest Expression

By “switching off” all optional elements, a much simpler table willbe produced:

freq(iris$Species, report.nas = FALSE, totals = FALSE, cumul = FALSE, headings = FALSE)
Freq%
setosa5033.33
versicolor5033.33
virginica5033.33

While the output is much simplified, the syntax is not; I blame it onTesler’s lawof conservation of complexity! Thankfully, st_options()is there to accommodate everyone’s preferences (see section on package options).

2.3 Multiple Frequency Tables At Once

To generate frequency tables for all variables in a data frame, wecould (and in the earliest versions, needed to) uselapply(). However, this is not required sincefreq() accepts data frames as the main argument:

freq(tobacco)

To avoid cluttering the results, numerical columns havingmore than 25 distinct values are ignored. This threshold of 25 can bechanged by using st_options(); for example, to change it to10, we’d use st_options(freq.ignore.threshold = 10).

Introduction to summarytools (2)The tobacco data frame contains simulated data and is includedin the package. Another simulated data frame is included:exams. Both have French versions (tabagisme,examens).

2.4 Subsetting (Filtering) Frequency Tables

The rows parameter allows subsetting frequency tables;we can use this parameter in different ways:

  • To filter rows by their order of appearance, we use a numericalvector; rows = 1:10 will show the frequencies for the first10 values only. To account for the frequencies of unshown values, the“(Other)” row is automatically added
  • To filter rows by name, we can use either
    • a character vector specifying all the row names we wish tokeep
    • a single character string, which will be used as a regularexpression (see ?regex for more information on thistopic)

Showing The Most Common Values

By combining the order and rows parameters,we can easily filter the results to show, for example, the 5 most commonvalues in a factor:

freq(tobacco$disease, order = "freq", rows = 1:5, headings = FALSE)
Freq% Valid% Valid Cum.% Total% Total Cum.
Hypertension3616.2216.223.603.60
Cancer3415.3231.533.407.00
Cholesterol219.4640.992.109.10
Heart209.0150.002.0011.10
Pulmonary209.0159.012.0013.10
(Other)9140.99100.009.1022.20
<NA>77877.80100.00
Total1000100.00100.00100.00100.00

Instead of "freq", we can use "-freq" toreverse the ordering and get results ranked from lowest to highest infrequency.

Introduction to summarytools (3)Notice the “(Other)” row, which is automaticallygenerated.

2.5 Collapsible Sections

When generating html results, use thecollapse = TRUE argument with print() orview() / stview() to get collapsible sections;clicking on the variable name in the heading section will collapse /reveal the frequency table (results not shown).

view(freq(tobacco), collapse = TRUE)

<< 2. Frequency Tables: freq() |TOC | 4. Descriptive Statistics:descr() >>

ctable() generates cross-tabulations (joint frequencies)for pairs of categorical variables.

Using the tobacco simulated data frame, we’ll cross-tabulatethe two categorical variables smoker and diseased.

ctable(x = tobacco$smoker, y = tobacco$diseased, prop = "r") # Show row proportions

Cross-Tabulation, Row Proportions

smoker * diseased
Data Frame: tobacco

diseasedYesNoTotal
smoker
Yes125 (41.9%)173 (58.1%)298 (100.0%)
No99 (14.1%)603 (85.9%)702 (100.0%)
Total224 (22.4%)776 (77.6%)1000 (100.0%)

As can be seen, since markdown does not fully supportmultiline table headings, pander does what it can todisplay this particular type of table. To get better results, the“render” method is recommended and will be used in the nextexamples.

3.1 Row, Column, or Total Proportions

Row proportions are shown by default. To display column ortotal proportions, use prop = "c" orprop = "t", respectively. To omit proportions altogether,use prop = "n".

3.2 Minimal Cross-Tabulations

By “switching off” all optional features, we get a simple “2 x 2”table:

with(tobacco, print(ctable(x = smoker, y = diseased, prop = 'n', totals = FALSE, headings = FALSE), method = "render"))
diseased
smokerYesNo
Yes125173
No99603

3.3 Chi-Square (𝛘2), Odds Ratio and Risk Ratio

To display the chi-square statistic, set chisq = TRUE.For 2 x 2 tables, use OR and RR toshow odds ratio and risk ratio (also called relative risk),respectively. Those can be set to TRUE, in which case 95%confidence intervals are shown; to use different confidence levels, usefor example OR = .90.

Introduction to summarytools (4)Using pipes generally makes it easier to generate ctable()results.
library(magrittr)tobacco %$% # Acts like with(tobacco, ...) ctable(x = smoker, y = diseased, chisq = TRUE, OR = TRUE, RR = TRUE, headings = FALSE) %>% print(method = "render")
diseased
smokerYesNoTotal
Yes125(41.9%)173(58.1%)298(100.0%)
No99(14.1%)603(85.9%)702(100.0%)
Total224(22.4%)776(77.6%)1000(100.0%)
Χ2 = 91.7088df = 1p = .0000
O.R. (95% C.I.) = 4.40(3.22 - 6.02)
R.R. (95% C.I.) = 2.97(2.37 - 3.73)

<< 3. Cross-Tabs: ctable() |TOC | 5. Data Frame Summaries:dfSummary() >>

descr() generates descriptive / univariate statistics,i.e. common central tendency statistics and measures ofdispersion. It accepts single vectors as well as data frames; in thelatter case, all non-numerical columns are ignored, with a message tothat effect.

descr(iris)
Non-numerical variable(s) ignored: Species

Descriptive Statistics

iris
N: 150

Petal.LengthPetal.WidthSepal.LengthSepal.Width
Mean3.761.205.843.06
Std.Dev1.770.760.830.44
Min1.000.104.302.00
Q11.600.305.102.80
Median4.351.305.803.00
Q35.101.806.403.30
Max6.902.507.904.40
MAD1.851.041.040.44
IQR3.501.501.300.50
CV0.470.640.140.14
Skewness-0.27-0.100.310.31
SE.Skewness0.200.200.200.20
Kurtosis-1.42-1.36-0.610.14
N.Valid150.00150.00150.00150.00
Pct.Valid100.00100.00100.00100.00

To turn off the variable-type messages, usesilent = TRUE. It is possible to set that option globally,which we will do here, so it won’t be displayed in the remaining of thisvignette.

st_options(descr.silent = TRUE)

4.1 Transposing and Selecting Statistics

Results can be transposed by using transpose = TRUE, andstatistics can be selected using the stats argument:

descr(iris, stats = c("mean", "sd"), transpose = TRUE, headings = FALSE)
MeanStd.Dev
Petal.Length3.761.77
Petal.Width1.200.76
Sepal.Length5.840.83
Sepal.Width3.060.44

See ?descr for a list of all available statistics.Special values “all”, “fivenum”, and “common” are also valid. Thedefault value is “all”, and it can be modified usingst_options():

st_options(descr.stats = "common")

<< 4. Descriptive Statistics with descr() |TOC | 6. Grouped Statistics: stby()>>

dfSummary() creates a summary table with statistics,frequencies and graphs for all variables in a data frame. Theinformation displayed is type-specific (character, factor, numeric,date) and also varies according to the number of distinct values.

To see the results in RStudio’s Viewer (or in the default Web browserif working in another IDE or from a terminal window), use theview() function, or its twin stview() in caseof name conflicts:

view(dfSummary(iris))

Introduction to summarytools (5)

Introduction to summarytools (6)

Be careful to use view() and not View(). Ifyou use the latter, results will be shown in the data viewer.

Also, be mindful of the order in which the packages are loaded. Somepackages redefine view() to point to View();loading summarytools after these packages willensure its own view() works properly. Otherwise,stview() is always there as a foolproof alternative.

5.1 Using dfSummary() in R Markdown Documents

When using dfSummary() in R Markdown documents,it is generally a good idea to exclude a column or two to avoid marginoverflow. Since the Valid and Missing columns areredundant, we can drop either one of them.

dfSummary(tobacco, plain.ascii = FALSE, style = "grid", graph.magnif = 0.75, valid.col = FALSE, tmp.img.dir = "/tmp")

Introduction to summarytools (7)

The tmp.img.dir parameter is mandatorywhen generating dfSummaries in R Markdown documents,except for html rendering. The explanation for this can befound further below.

Introduction to summarytools (8)Some users reported repeated X11 warnings; those can be avoided bysetting the warning chunk option to FALSE:{r chunk_name, results="asis", warning=FALSE}.

5.2 Optional Statistics

This feature has been requested several times since the package wasreleased. Introduced in version 1.0.0, it provides control over whichstatistics to shown in the Stats/Values column. Namely, thethird row, which displays IQR (CV), can be modified to showany available statistics in R. An additional “slot” (unused by default)is also made available. To use this feature, definedfSummary.custom.1 and/or dfSummary.custom.2using st_options() in the following way, encapsulating thecode in an expression():

st_options( dfSummary.custom.1 = expression( paste( "Q1 - Q3 :", round( quantile(column_data, probs = .25, type = 2, names = FALSE, na.rm = TRUE), digits = 1 ), " - ", round( quantile(column_data, probs = .75, type = 2, names = FALSE, na.rm = TRUE), digits = 1 ) ) ))print( dfSummary(iris, varnumbers = FALSE, na.col = FALSE, style = "multiline", plain.ascii = FALSE, headings = FALSE, graph.magnif = .8), method = "render")
Variable Stats / Values Freqs (% of Valid) Graph Valid
Sepal.Length[numeric]
Mean (sd) : 5.8 (0.8)
min ≤ med ≤ max:
4.3 ≤ 5.8 ≤ 7.9
Q1 - Q3 : 5.1 - 6.4
35 distinct values Introduction to summarytools (9) 150(100.0%)
Sepal.Width[numeric]
Mean (sd) : 3.1 (0.4)
min ≤ med ≤ max:
2 ≤ 3 ≤ 4.4
Q1 - Q3 : 2.8 - 3.3
23 distinct values Introduction to summarytools (10) 150(100.0%)
Petal.Length[numeric]
Mean (sd) : 3.8 (1.8)
min ≤ med ≤ max:
1 ≤ 4.3 ≤ 6.9
Q1 - Q3 : 1.6 - 5.1
43 distinct values Introduction to summarytools (11) 150(100.0%)
Petal.Width[numeric]
Mean (sd) : 1.2 (0.8)
min ≤ med ≤ max:
0.1 ≤ 1.3 ≤ 2.5
Q1 - Q3 : 0.3 - 1.8
22 distinct values Introduction to summarytools (12) 150(100.0%)
Species[factor]
1. setosa
2. versicolor
3. virginica
50(33.3%)
50(33.3%)
50(33.3%)
Introduction to summarytools (13) 150(100.0%)

If we had used dfSummary.custom.2 instead ofdfSummary.custom.1, a fourth row would have been addedunder the default IQR (CV) row.

Note that instead of round(), it is possible to use theinternal format_number(), which ensures the number isformatted according to all specified arguments (rounding digits, decimalmark and thousands mark, etc.). The internal variableround.digits which contains the value ofst_options("round.digits") can also be used. This is howthe default IQR (CV) is defined – here we set the firstcustom stat back to its default value and then display its definition(formatR::tidy_source() is used to format / indent theexpression):

library(formatR)st_options(dfSummary.custom.1 = "default")formatR::tidy_source( text = deparse(st_options("dfSummary.custom.1")), indent = 2, args.newline = TRUE)
expression( paste( paste0( trs("iqr"), " (", trs("cv"), ") : " ), format_number( IQR(column_data, na.rm = TRUE), round.digits ), " (", format_number( sd(column_data, na.rm = TRUE)/mean(column_data, na.rm = TRUE), round.digits ), ")", collapse = "", sep = "" ))
Introduction to summarytools (14)Don’t forget to specify na.rm = TRUE for all functions thatuse this parameter (most of base R functions do).

5.3 Other Notable Features

The dfSummary() function also

  • Reports the number of duplicate records in the heading section
  • Detects UPC/EAN codes (barcode numbers) and doesn’t calculateirrelevant statistics for them
  • Detects email addresses and reports counts of valid, invalid andduplicate addresses; note that the proportions of valid and invalid sumup to 100%; the duplicates proportion is calculated independently, whichis why in the bar chart (html version), the bar for thiscategory is shown with a different color
  • Allows the display of “windowed” results by using themax.tbl.height parameter; This is especially convenient ifthe analyzed data frame has numerous variables; seevignette("rmarkdown", package = "summarytools") for moredetails

5.4 Excluding Columns

Although most columns can be excluded using the function’sparameters, it is also possible to delete them with the following syntax(results not shown):

dfs <- dfSummary(iris)dfs$Variable <- NULL # This deletes the "Variable" column

<< 5. Data Frame Summaries |TOC | 7. Grouped Statistics:group_by() >>

To produce optimal results, summarytools has its ownversion of the base by() function. It’s calledstby(), and we use it exactly as we wouldby():

(iris_stats_by_species <- stby(data = iris, INDICES = iris$Species, FUN = descr, stats = "common", transpose = TRUE))

Descriptive Statistics

iris
Group: Species = setosa
N: 50

MeanStd.DevMinMedianMaxN.ValidPct.Valid
Petal.Length1.460.171.001.501.9050.00100.00
Petal.Width0.250.110.100.200.6050.00100.00
Sepal.Length5.010.354.305.005.8050.00100.00
Sepal.Width3.430.382.303.404.4050.00100.00

Group: Species = versicolor
N: 50

MeanStd.DevMinMedianMaxN.ValidPct.Valid
Petal.Length4.260.473.004.355.1050.00100.00
Petal.Width1.330.201.001.301.8050.00100.00
Sepal.Length5.940.524.905.907.0050.00100.00
Sepal.Width2.770.312.002.803.4050.00100.00

Group: Species = virginica
N: 50

MeanStd.DevMinMedianMaxN.ValidPct.Valid
Petal.Length5.550.554.505.556.9050.00100.00
Petal.Width2.030.271.402.002.5050.00100.00
Sepal.Length6.590.644.906.507.9050.00100.00
Sepal.Width2.970.322.203.003.8050.00100.00

6.1 Special Case of descr() with stby()

When used to produce split-group statistics for a single variable,stby() assembles everything into a single table instead ofdisplaying a series of one-column tables.

with(tobacco, stby(data = BMI, INDICES = age.gr, FUN = descr, stats = c("mean", "sd", "min", "med", "max")))

Descriptive Statistics

BMI by age.gr
Data Frame: tobacco
N: 258

18-3435-5051-7071 +
Mean23.8425.1126.9127.45
Std.Dev4.234.344.264.37
Min8.8310.359.0116.36
Median24.0425.1126.7727.52
Max34.8439.4439.2138.37

6.2 Using stby() with ctable()

The syntax is a little trickier for this combination, so here is anexample (results not shown):

stby(data = list(x = tobacco$smoker, y = tobacco$diseased), INDICES = tobacco$gender, FUN = ctable)# or equivalentlywith(tobacco, stby(data = list(x = smoker, y = diseased), INDICES = gender, FUN = ctable))

<< 6. Grouped Statistics : group_by() |TOC | 8. Tidy Tables : tb()>>

To create grouped statistics with freq(),descr() or dfSummary(), it is possible to usedplyr’s group_by() as an alternative tostby(). Syntactic differences aside, one key distinction isthat group_by() considers NA values on thegrouping variable(s) as a valid category, albeit with a warningsuggesting the use of forcats::fct_explicit_na to makeNA’s explicit in factors. Following this advice, weget:

library(dplyr)tobacco$gender %<>% forcats::fct_explicit_na()tobacco %>% group_by(gender) %>% descr(stats = "fivenum")
Warning: package 'dplyr' was built under R version 4.1.3

Descriptive Statistics

tobacco
Group: gender = F
N: 489

BMIagecigs.per.daysamp.wgts
Min9.0118.000.000.86
Q122.9834.000.000.86
Median25.8750.000.001.04
Q329.4866.0010.501.05
Max39.4480.0040.001.06

Group: gender = M
N: 489

BMIagecigs.per.daysamp.wgts
Min8.8318.000.000.86
Q122.5234.000.000.86
Median25.1449.500.001.04
Q327.9666.0011.001.05
Max36.7680.0040.001.06

Group: gender = (Missing)
N: 22

BMIagecigs.per.daysamp.wgts
Min20.2419.000.000.86
Q124.9736.000.001.04
Median27.1655.500.001.05
Q330.2364.0010.001.05
Max32.4380.0028.001.06

<< 7. Grouped Statistics : group_by() |TOC | 9. Directing Output to Files>>

When generating freq() or descr() tables,it is possible to turn the results into “tidy” tables with the use ofthe tb() function (think of tb as a diminutive fortibble). For example:

library(magrittr)iris %>% descr(stats = "common") %>% tb()
# A tibble: 4 x 8 variable mean sd min med max n.valid pct.valid <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>1 Petal.Length 3.76 1.77 1 4.35 6.9 150 1002 Petal.Width 1.20 0.762 0.1 1.3 2.5 150 1003 Sepal.Length 5.84 0.828 4.3 5.8 7.9 150 1004 Sepal.Width 3.06 0.436 2 3 4.4 150 100
iris$Species %>% freq(cumul = FALSE, report.nas = FALSE) %>% tb()
# A tibble: 3 x 3 Species freq pct <fct> <dbl> <dbl>1 setosa 50 33.32 versicolor 50 33.33 virginica 50 33.3

By definition, no total rows are part of tidy tables, andthe row names are converted to a regular column.

Introduction to summarytools (15)When displaying tibbles using rmarkdown, theknitr chunk option results should be setto ‘markup’ instead of ‘asis’.

8.1 Tidy Split-Group Statistics

Here are some examples showing how lists created usingstby() or group_by() can be transformed intotidy tibbles.

grouped_descr <- stby(data = exams, INDICES = exams$gender, FUN = descr, stats = "common")grouped_descr %>% tb()
# A tibble: 12 x 9 gender variable mean sd min med max n.valid pct.valid <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Girl economics 72.5 7.79 62.3 70.2 89.6 14 93.3 2 Girl english 73.9 9.41 58.3 71.8 93.1 14 93.3 3 Girl french 71.1 12.4 44.8 68.4 93.7 14 93.3 4 Girl geography 67.3 8.26 50.4 67.3 78.9 15 100 5 Girl history 71.2 9.17 53.9 72.9 86.4 15 100 6 Girl math 73.8 9.03 55.6 74.8 86.3 14 93.3 7 Boy economics 75.2 9.40 60.5 71.7 94.2 15 100 8 Boy english 77.8 5.94 69.6 77.6 90.2 15 100 9 Boy french 76.6 8.63 63.2 74.8 94.7 15 100 10 Boy geography 73 12.4 47.2 71.2 96.3 14 93.311 Boy history 74.4 11.2 54.4 72.6 93.5 15 100 12 Boy math 73.3 9.68 60.5 72.2 93.2 14 93.3

The order parameter controls row ordering:

grouped_descr %>% tb(order = 2)
# A tibble: 12 x 9 gender variable mean sd min med max n.valid pct.valid <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Girl economics 72.5 7.79 62.3 70.2 89.6 14 93.3 2 Boy economics 75.2 9.40 60.5 71.7 94.2 15 100 3 Girl english 73.9 9.41 58.3 71.8 93.1 14 93.3 4 Boy english 77.8 5.94 69.6 77.6 90.2 15 100 5 Girl french 71.1 12.4 44.8 68.4 93.7 14 93.3 6 Boy french 76.6 8.63 63.2 74.8 94.7 15 100 7 Girl geography 67.3 8.26 50.4 67.3 78.9 15 100 8 Boy geography 73 12.4 47.2 71.2 96.3 14 93.3 9 Girl history 71.2 9.17 53.9 72.9 86.4 15 100 10 Boy history 74.4 11.2 54.4 72.6 93.5 15 100 11 Girl math 73.8 9.03 55.6 74.8 86.3 14 93.312 Boy math 73.3 9.68 60.5 72.2 93.2 14 93.3

Setting order = 3 changes the order of the sortvariables exactly as with order = 2, but it also reordersthe columns:

grouped_descr %>% tb(order = 3)
# A tibble: 12 x 9 variable gender mean sd min med max n.valid pct.valid <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 economics Girl 72.5 7.79 62.3 70.2 89.6 14 93.3 2 economics Boy 75.2 9.40 60.5 71.7 94.2 15 100 3 english Girl 73.9 9.41 58.3 71.8 93.1 14 93.3 4 english Boy 77.8 5.94 69.6 77.6 90.2 15 100 5 french Girl 71.1 12.4 44.8 68.4 93.7 14 93.3 6 french Boy 76.6 8.63 63.2 74.8 94.7 15 100 7 geography Girl 67.3 8.26 50.4 67.3 78.9 15 100 8 geography Boy 73 12.4 47.2 71.2 96.3 14 93.3 9 history Girl 71.2 9.17 53.9 72.9 86.4 15 100 10 history Boy 74.4 11.2 54.4 72.6 93.5 15 100 11 math Girl 73.8 9.03 55.6 74.8 86.3 14 93.312 math Boy 73.3 9.68 60.5 72.2 93.2 14 93.3

For more details, see ?tb.

8.2 A Bridge to Other Packages

summarytools objects are not always compatible withpackages focused on table formatting, such as formattable orkableExtra.However, tb() can be used as a “bridge”, an intermediarystep turning freq() and descr() objects intosimple tables that any package can work with. Here is an example usingkableExtra:

library(kableExtra)library(magrittr)stby(data = iris, INDICES = iris$Species, FUN = descr, stats = "fivenum") %>% tb(order = 3) %>% kable(format = "html", digits = 2) %>% collapse_rows(columns = 1, valign = "top")
variableSpeciesminq1medq3max
Petal.Lengthsetosa1.01.41.501.61.9
Petal.Lengthversicolor3.04.04.354.65.1
Petal.Lengthvirginica4.55.15.555.96.9
Petal.Widthsetosa0.10.20.200.30.6
Petal.Widthversicolor1.01.21.301.51.8
Petal.Widthvirginica1.41.82.002.32.5
Sepal.Lengthsetosa4.34.85.005.25.8
Sepal.Lengthversicolor4.95.65.906.37.0
Sepal.Lengthvirginica4.96.26.506.97.9
Sepal.Widthsetosa2.33.23.403.74.4
Sepal.Widthversicolor2.02.52.803.03.4
Sepal.Widthvirginica2.22.83.003.23.8

<< 8. Tidy Tables : tb() |TOC | 10. Global Options>>

Using the file argument with print() orview() / stview(), we can write outputs to afile, be it html, Rmd, md, or just plain text(txt). The file extension is used by the package to determinethe type of content to write out.

view(iris_stats_by_species, file = "~/iris_stats_by_species.html")view(iris_stats_by_species, file = "~/iris_stats_by_species.md")

A Note on PDF documents

There is no direct way to create a PDF file withsummarytools. One option is to generate anhtml file and convert it to PDF using Pandoc or WK<html>TOpdf (thelatter gives better results than Pandoc withdfSummary() output).

Another option is to create an Rmd document usingPDF as the output format. Seevignette("rmarkdown", package = "summarytools") for thedetails on how to proceed.

9.1 Appending Output Files

The append argument allows adding content to existingfiles generated by summarytools. This is useful if wewish to include several statistical tables in a single file. It is aquick alternative to creating an Rmd document.


<< 9. Directing Output to Files |TOC | 11. Format Attributes>>

The following options can be set globally withst_options():

10.1 General Options

Option nameDefaultNote
style (1)“simple”Set to “rmarkdown” in .Rmd documents
plain.asciiTRUESet to FALSE in .Rmd documents
round.digits (2)2Number of decimals to show
headingsTRUEFormerly “omit.headings”
footnote“default”Customize or set to NA to omit
display.labelsTRUEShow variable / data frame labels in headings
bootstrap.css (3)TRUEInclude Bootstrap 4 CSS in html outputfiles
custom.cssNAPath to your own CSS file
escape.pipeFALSEUseful for some Pandoc conversions
char.split (4)12Threshold for line-wrapping in column headings
subtitle.emphasisTRUEControls headings formatting
lang“en”Language (always 2-letter, lowercase)

1 Does not apply to dfSummary(), which hasits own style option (see next table)
2 Does not apply to ctable(), which has its ownround.digits option (see next table)
3 Set to FALSE in Shiny apps
4 Affects only html outputs for descr()and ctable()

10.2 Function-Specific Options

Option nameDefaultNote
freq.cumulTRUEDisplay cumulative proportions in freq()
freq.totalsTRUEDisplay totals row in freq()
freq.report.nasTRUEDisplay row and “valid” columns
freq.ignore.threshold (1)25Used to determine which vars to ignore
freq.silentFALSEHide console messages
ctable.prop“r”Display row proportions bydefault
ctable.totalsTRUEShow marginal totals
ctable.round.digits1Number of decimals to show inctable()
descr.stats“all”“fivenum”, “common” or vector of stats
descr.transposeFALSEDisplay stats in columns instead of rows
descr.silentFALSEHide console messages
dfSummary.style“multiline”Can be set to “grid” as an alternative
dfSummary.varnumbersTRUEShow variable numbers in 1st col.
dfSummary.labels.colTRUEShow variable labels when present
dfSummary.graph.colTRUEShow graphs
dfSummary.valid.colTRUEInclude the Valid column in the output
dfSummary.na.colTRUEInclude the Missing column in the output
dfSummary.graph.magnif1Zoom factor for bar plots and histograms
dfSummary.silentFALSEHide console messages
tmp.img.dir (2)NADirectory to store temporaryimages
use.x11 (3)TRUEAllow creation of Base64-encoded graphs

1 See section 2.3 fordetails
2 Applies to dfSummary() only
3 Set to FALSE in text-only environments

Examples

st_options() # Display all global options valuesst_options('round.digits') # Display the value of a specific optionst_options(style = 'rmarkdown', # Set the value of one or several options footnote = NA) # Turn off the footnote for all html output

<< 10. Global Options |TOC | 12. Fine-Tuning Looks : CSS>>

When a summarytools object is created, itsformatting attributes are stored within it. However, we can overridemost of them when using print() or view().

11.1 Overriding Function-Specific Arguments

The following table indicates what arguments can be used withprint() or view() to override formattingattributes. Base R’s format() function arguments can alsobe used (although they are not listed here).

ArgumentfreqctabledescrdfSummary
stylexxxx
round.digitsxxx
plain.asciixxxx
justifyxxxx
headingsxxxx
display.labelsxxxx
varnumbersx
labels.colx
graph.colx
valid.colx
na.colx
col.widthsx
totalsxx
report.nasx
display.typex
missingx
split.tables (1)xxxx
caption (1)xxxx

1 pander options

11.2 Overriding Heading Contents

To change the information shown in the heading section, use thefollowing arguments with print() orview():

ArgumentfreqctabledescrdfSummary
Data.framexxxx
Data.frame.labelxxxx
Variablexxx
Variable.labelxxx
Groupxxxx
datexxxx
Weightsxx
Data.typex
Row.variablex
Col.variablex

Example

In the following example, we will create and display afreq() object, and then display it again, this timeoverriding three of its formatting attributes, as well as one of itsheading attributes.

(age_stats <- freq(tobacco$age.gr)) 

Frequencies

tobacco$age.gr
Type: Factor

Freq% Valid% Valid Cum.% Total% Total Cum.
18-3425826.4626.4625.8025.80
35-5024124.7251.1824.1049.90
51-7031732.5183.6931.7081.60
71 +15916.31100.0015.9097.50
<NA>252.50100.00
Total1000100.00100.00100.00100.00
print(age_stats, report.nas = FALSE, totals = FALSE, display.type = FALSE, Variable.label = "Age Group")

Frequencies

tobacco$age.gr
Label: Age Group

Freq%% Cum.
18-3425826.4626.46
35-5024124.7251.18
51-7031732.5183.69
71 +15916.31100.00

11.3 Order of Priority for Parameters / Options

  1. print() or view() parameters haveprecedence (overriding feature)
  2. freq() / ctable() / descr() / dfSummary() parameterscome second
  3. Global options set with st_options() come third and actas default

The logic for the evaluation of the various parameter values can besummarized as follows:

  • If an argument is explicitly supplied in the function call, itwill have precedence over any stored value for the parameter (storedvalues are the ones that are written to the object’s attributes whenusing a core function, as well as the ones stored insummarytools’ global options list).

  • If both a core function and the print or view function are calledat once and have conflicting parameter values, print/view has precedence(they always win the argument!).

  • if the parameter values cannot be found in the function calls,the stored defaults (modified with st_options() or left asthey are when loading the package) will be applied.


<< 11. Format Attributes |TOC | 13. Shiny Apps>>

When creating html reports, both Bootstrap’s CSS andsummarytools.css are included by default. For greater controlon the looks of html content, it is also possible to add classdefinitions in a custom CSS file.

Example

We need to use a very small font size for a simple htmlreport containing a dfSummary(). For this, we create a.css file (with the name of our choosing) which contains thefollowing class definition:

.tiny-text { font-size: 8px;}

Then we use print()’s custom.css argumentto specify to location of our newly created CSS file (resultsnot shown):

print(dfSummary(tobacco), custom.css = 'path/to/custom.css', table.classes = 'tiny-text', file = "tiny-tobacco-dfSummary.html")

<< 12. Fine-Tuning Looks : CSS |TOC | 14. Graphs in R Markdown>>

To successfully include summarytools functions inShiny apps,

  • use html rendering
  • set bootstrap.css = FALSE to avoid interacting with theapp’s layout
  • set headings = FALSE in case problems arise
  • adjust graph sizes with the graph.magnif parameter orwith the dfSummary.graph.magnif global option
  • if dfSummary() tables are too wide, omit a column ortwo (valid.col and varnumbers, forinstance)
  • if the results are still unsatisfactory, set column widths manuallywith the col.widths parameter
  • if col.widths or graph.magnig do not seemto work, try using them as parameters for print() ratherthan dfSummary()

Example (results not shown)

print(dfSummary(somedata, varnumbers = FALSE, valid.col = FALSE, graph.magnif = 0.8), method = 'render', headings = FALSE, bootstrap.css = FALSE)

<< 13. Shiny Apps |TOC |

When using dfSummary() in an Rmd document usingmarkdown styling (as opposed to html rendering), threeelements are needed in order to display the png graphsproperly:

1 - plain.ascii must be set to FALSE
2 - style must be set to “grid”
3 - tmp.img.dir must be defined and be at most 5 characterswide

Note that as of version 0.9.9, settingtmp.img.dir is no longer required whenusing method = "render" and can be left toNA. It is only necessary to define it when a transitorymarkdown table must be created, as shown below. Note how narrow theGraph column is – this is actually required, since the width ofthe rendered column is determined by the number of characters in thecell, rather than the width of the image itself:

+---------------+--------|----------------------+---------+| Variable | stats | Graph | Valid |+===============+========|======================+=========+| age\ | ... | ![](/tmp/ds0001.png) | 978\ || [numeric] | ... | | (97.8%) |+---------------+--------+----------------------+---------+

CRAN policies are really strict when it comes to writing content inthe user directories, or anywhere outside R’s temporary zone(for good reasons). So users need to set this temporary locationthemselves, therefore consenting to having content written outsideR’s predefined temporary zone.

On Mac OS and Linux, using “/tmp” makes a lot of sense: it’s a shortpath, and the directory is purged automatically. On Windows, there is nosuch convenient directory, so we need to pick one – be it absolute(“/tmp”) or relative (“img”, or simply “.”).


<< 14. Graphs in R Markdown |TOC | 16. Vignette Setup>>

Thanks to the R community’s efforts, the following languagescan be used, in addition to English (default):

  • French (fr)
  • Portuguese (pt)
  • Russian (ru)
  • Spanish (es)
  • Turkish (tr)

To switch languages, simply use

st_options(lang = "fr")

All output from the core functions will now use that language:

freq(iris$Species)

Tableau de frĂŠquences

iris$Species
Type: Facteur

FrĂŠq.% Valide% Valide cum.% Total% Total cum.
setosa5033.3333.3333.3333.33
versicolor5033.3366.6733.3366.67
virginica5033.33100.0033.33100.00
<NA>00.00100.00
Total150100.00100.00100.00100.00

15.1 Non-UTF-8 Locales

On most Windows systems, it is necessary to change theLC_CTYPE element of the locale settings if the characterset is not included in the system’s default locale. For instance, inorder to get good results with the Russian language in a “latin1”environment, use the following settings:

Sys.setlocale("LC_CTYPE", "russian")st_options(lang = 'ru')

To go back to default settings…

Sys.setlocale("LC_CTYPE", "")st_options(lang = "en")

15.2 Defining and Using Custom Terms

Using the function use_custom_lang(), it is possible toadd your own set of translations or personalized terms. To achieve this,get the csvtemplate, customize one, many or all of the +/- 70 terms, and calluse_custom_lang(), giving it as sole argument the path tothe edited csv template. Note that such custom languagesettings will not persist across R sessions. This means that you shouldalways have this csv file handy for future use.

15.3 Defining Only Specific Keywords

The define_keywords() makes it easy to change just oneor a few terms. For instance, you might prefer using “N” or “Count”rather than “Freq” in the title row of freq() tables. Oryou might want to generate a document which uses the tables’ titles asheading sections.

For this, call define_keywords() and feed it the term(s)you wish to modify (which can themselves be stored in predefinedvariables). Here, the terms we need to change arefreq.title and freq:

section_title <- "**Species of Iris**"define_keywords(title.freq = section_title, freq = "N")freq(iris$Species)

Species of Iris

iris$Species
Type: Facteur

N% Valide% Valide cum.% Total% Total cum.
setosa5033.3333.3333.3333.33
versicolor5033.3366.6733.3366.67
virginica5033.33100.0033.33100.00
<NA>00.00100.00
Total150100.00100.00100.00100.00

Calling define_keywords() without any arguments willbring up, on systems that support graphical devices (the vast majority,that is), a window from which we can edit all the terms we want.

Introduction to summarytools (16)

After closing the edit window, a dialogue box gives the option tosave the newly created custom language to a csv file (eventhough we changed just a few keywords, the package considers the termsas a whole). We can later reload into memory the custom language file bycallinguse_custom_lang("path-to-custom-language-file.csv").

See ?define_keywords for a list of all customizableterms in the package.

To revert all changes, we can simply usest_options(lang = "en").

15.4 Power-Tweaking Headings

It is possible to further customize the headings by adding argumentsto the print() function. Here, we use an empty string tooverride the value of Variable; this causes the second lineof the heading to disappear altogether.

define_keywords(title.freq = "Types and Counts, Iris Flowers")print( freq(iris$Species, display.type = FALSE), # Variable type won't be displayed... Variable = "" # and neither will the variable name ) 

Types and Counts, Iris Flowers

N% Valide% Valide cum.% Total% Total cum.
setosa5033.3333.3333.3333.33
versicolor5033.3366.6733.3366.67
virginica5033.33100.0033.33100.00
<NA>00.00100.00
Total150100.00100.00100.00100.00

| TOC | 17.Conclusion >>

Knowing how this vignette is configured can help you get started withusing summarytools in R Markdowndocuments.

16.1 The YAML Section

The output element is the one that matters:

---output: rmarkdown::html_vignette: css: - !expr system.file("rmarkdown/templates/html_vignette/resources/vignette.css", package = "rmarkdown")---

16.2 The Setup Chunk

```{r setup, include=FALSE} library(knitr)opts_chunk$set(results = 'asis', # Can also be set at chunk level comment = NA, prompt = FALSE, cache = FALSE)library(summarytools)st_options(plain.ascii = FALSE, # Always use in Rmd documents style = "rmarkdown", # Always use in Rmd documents subtitle.emphasis = FALSE) # Improves layout w/ some themes```

16.3 Including summarytools’ CSS

The needed CSS is automatically added to html files createdusing print() or view() with thefile argument. But in R Markdown documents, thisneeds to be done explicitly in a setup chunk just after the YAML header(or following a first setup chunk specifying knitr andsummarytools options):

```{r, echo=FALSE} st_css(main = TRUE, global = TRUE)```

<< 16. Vignette Setup |TOC

The package comes with no guarantees. It is a work in progress andfeedback is always welcome. Please open an issue onGitHub if you find a bug or wish to submit a feature request.

Stay Up to Date

Check out the GitHub project’spage; from there you can see the latest updates and also submitfeature requests.

For a preview of what’s coming in the next release, have a look atthe developmentbranch.


TOC

Introduction to summarytools (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Patricia Veum II

Last Updated:

Views: 6284

Rating: 4.3 / 5 (64 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Patricia Veum II

Birthday: 1994-12-16

Address: 2064 Little Summit, Goldieton, MS 97651-0862

Phone: +6873952696715

Job: Principal Officer

Hobby: Rafting, Cabaret, Candle making, Jigsaw puzzles, Inline skating, Magic, Graffiti

Introduction: My name is Patricia Veum II, I am a vast, combative, smiling, famous, inexpensive, zealous, sparkling person who loves writing and wants to share my knowledge and understanding with you.