Dominic Comtois
2022-05-19
- 1. Overview
- 2. Frequency Tables:freq()
- 3. Cross-Tabulations:ctable()
- 4. Descriptive Statistics:descr()
- 5. Data Frame Summaries:dfSummary()
- 6. Grouped Statistics:stby()
- 7. Grouped Statistics:group_by()
- 8. Tidy Tables : tb()
- 9. Directing Output toFiles
- 10. Package Options
- 11. Format Attributes
- 12. Fine-Tuning Looks :CSS
- 13. Shiny Apps
- 14. Graphs in R Markdown
- 16. Vignette Setup
- 17. Conclusion
summarytools provides a coherent set of functionscentered on data exploration and simple reporting. At its core residethe following four functions:
Function | Description |
---|---|
freq() | Frequency Tables featuring counts, proportions,cumulative statistics as well as missing data reporting |
ctable() | Cross-Tabulations (joint frequencies) between pairs ofdiscrete/categorical variables, featuring marginal sums as well as row,column or total proportions |
descr() | Descriptive (Univariate) Statistics for numerical data,featuring common measures of central tendency and dispersion |
dfSummary() | Data Frame Summaries featuring type-specificinformation for all variables: univariate statistics and/or frequencydistributions, bar charts or histograms, as well as missing data countsand proportions. Very useful to quickly, detect anomalies and identifytrends at a glance |
1.1 Motivation
The package was developed with the following objectives in mind:
- Provide a coherent set of easy-to-use descriptive functions that areakin to those included in commercial statistical software suites such asSAS, SPSS, and Stata
- Offer flexibility in terms of output format & content
- Integrate well with commonly used software & tools for reporting(the RStudioIDE, Rmarkdown, and knitr) while also allowing forstandalone, simple report generation using any R interface
1.2 Directing Output
Results can be
- Displayed in the R console as plain text
- Rendered as html and shown in a Web browser or in RStudioâsViewer Pane
- Written / appended to plain text, markdown, orhtml files
When creating R Markdown documents, makesure to
- Use chunk option
result="asis"
- Une the function argument
plain.ascii=FALSE
- Set the style parameter to ârmarkdownâ, or âgridâ for
dfSummary()
1.3 Other Characteristics
- Weights-enabled:
freq()
,ctable()
anddescr()
support samplingweights - Multilingual:
- Built-in translations exist for French, Portuguese, Spanish,Russian, and Turkish. Users can easily add custom translations or modifyexisting ones as needed
- Flexible and extensible:
- The built-in features used to support alternate languages provide away to modify a great number of terms used in outputs (headings andtables)
- Pipe operators from magrittr(
%>%
,%$%
) and pipeR(%>>%
) are fully supported; the native|>
introduced in R 4.0 is supported as well - Default values for a good number of function parameters can bemodified using
st_options()
to minimize redundancy infunction calls - By-group processing is easily achieved using thepackageâs
stby()
function which is a slightly modifiedversion of basebase::by()
, butdplyr::group_by()
is also supported - Panderoptions can be used to customize or enhance plain text andmarkdown tables
- Base Râs
format()
parameters are also supported; thiscan be used to set thousands separator or modify the decimal separator,among several other possibilities (seehelp("format")
) - BootstrapCSS is used by default with html output, anduser-defined classes can be added at will
<< 1. Overview | TOC |3. Cross-Tabulations: ctable() >>
The freq()
function generates frequencytables with counts, proportions, as well as missing datainformation. Side note: the very idea for creating this package stemmedfrom the absence of such a function in base R.
freq(iris$Species, plain.ascii = FALSE, style = "rmarkdown")
Frequencies
iris$Species
Type: Factor
Freq | % Valid | % Valid Cum. | % Total | % Total Cum. | |
---|---|---|---|---|---|
setosa | 50 | 33.33 | 33.33 | 33.33 | 33.33 |
versicolor | 50 | 33.33 | 66.67 | 33.33 | 66.67 |
virginica | 50 | 33.33 | 100.00 | 33.33 | 100.00 |
<NA> | 0 | 0.00 | 100.00 | ||
Total | 150 | 100.00 | 100.00 | 100.00 | 100.00 |
In this first example, the plain.ascii
andstyle
arguments were specified. However, since we havedefined them globally for this document using st_options()
,they are redundant and will be omitted from hereon(section 16 contains a detailed description of thisvignetteâs configuration).
2.1 Missing Data
One of summarytoolsâ main purposes is to helpcleaning and preparing data for further analysis. But in somecircumstances, we donât need (or already have) information about missingdata. Using report.nas = FALSE
makes the output tablesmaller by one row and two columns:
freq(iris$Species, report.nas = FALSE, headings = FALSE)
Freq | % | % Cum. | |
---|---|---|---|
setosa | 50 | 33.33 | 33.33 |
versicolor | 50 | 33.33 | 66.67 |
virginica | 50 | 33.33 | 100.00 |
Total | 150 | 100.00 | 100.00 |
The headings = FALSE parameter suppresses the headingsection. |
2.2 Simplest Expression
By âswitching offâ all optional elements, a much simpler table willbe produced:
freq(iris$Species, report.nas = FALSE, totals = FALSE, cumul = FALSE, headings = FALSE)
Freq | % | |
---|---|---|
setosa | 50 | 33.33 |
versicolor | 50 | 33.33 |
virginica | 50 | 33.33 |
While the output is much simplified, the syntax is not; I blame it onTeslerâs lawof conservation of complexity! Thankfully, st_options()
is there to accommodate everyoneâs preferences (see section on package options).
2.3 Multiple Frequency Tables At Once
To generate frequency tables for all variables in a data frame, wecould (and in the earliest versions, needed to) uselapply()
. However, this is not required sincefreq()
accepts data frames as the main argument:
freq(tobacco)
To avoid cluttering the results, numerical columns havingmore than 25 distinct values are ignored. This threshold of 25 can bechanged by using st_options()
; for example, to change it to10, weâd use st_options(freq.ignore.threshold = 10)
.
The tobacco data frame contains simulated data and is includedin the package. Another simulated data frame is included:exams. Both have French versions (tabagisme,examens). |
2.4 Subsetting (Filtering) Frequency Tables
The rows
parameter allows subsetting frequency tables;we can use this parameter in different ways:
- To filter rows by their order of appearance, we use a numericalvector;
rows = 1:10
will show the frequencies for the first10 values only. To account for the frequencies of unshown values, theâ(Other)â row is automatically added - To filter rows by name, we can use either
- a character vector specifying all the row names we wish tokeep
- a single character string, which will be used as a regularexpression (see
?regex
for more information on thistopic)
- a character vector specifying all the row names we wish tokeep
Showing The Most Common Values
By combining the order
and rows
parameters,we can easily filter the results to show, for example, the 5 most commonvalues in a factor:
freq(tobacco$disease, order = "freq", rows = 1:5, headings = FALSE)
Freq | % Valid | % Valid Cum. | % Total | % Total Cum. | |
---|---|---|---|---|---|
Hypertension | 36 | 16.22 | 16.22 | 3.60 | 3.60 |
Cancer | 34 | 15.32 | 31.53 | 3.40 | 7.00 |
Cholesterol | 21 | 9.46 | 40.99 | 2.10 | 9.10 |
Heart | 20 | 9.01 | 50.00 | 2.00 | 11.10 |
Pulmonary | 20 | 9.01 | 59.01 | 2.00 | 13.10 |
(Other) | 91 | 40.99 | 100.00 | 9.10 | 22.20 |
<NA> | 778 | 77.80 | 100.00 | ||
Total | 1000 | 100.00 | 100.00 | 100.00 | 100.00 |
Instead of "freq"
, we can use "-freq"
toreverse the ordering and get results ranked from lowest to highest infrequency.
Notice the â(Other)â row, which is automaticallygenerated. |
2.5 Collapsible Sections
When generating html results, use thecollapse = TRUE
argument with print()
orview()
/ stview()
to get collapsible sections;clicking on the variable name in the heading section will collapse /reveal the frequency table (results not shown).
view(freq(tobacco), collapse = TRUE)
<< 2. Frequency Tables: freq() |TOC | 4. Descriptive Statistics:descr() >>
ctable()
generates cross-tabulations (joint frequencies)for pairs of categorical variables.
Using the tobacco simulated data frame, weâll cross-tabulatethe two categorical variables smoker and diseased.
ctable(x = tobacco$smoker, y = tobacco$diseased, prop = "r") # Show row proportions
Cross-Tabulation, Row Proportions
smoker * diseased
Data Frame: tobacco
diseased | Yes | No | Total | |
smoker | ||||
Yes | 125 (41.9%) | 173 (58.1%) | 298 (100.0%) | |
No | 99 (14.1%) | 603 (85.9%) | 702 (100.0%) | |
Total | 224 (22.4%) | 776 (77.6%) | 1000 (100.0%) |
As can be seen, since markdown does not fully supportmultiline table headings, pander does what it can todisplay this particular type of table. To get better results, theârenderâ method is recommended and will be used in the nextexamples.
3.1 Row, Column, or Total Proportions
Row proportions are shown by default. To display column ortotal proportions, use prop = "c"
orprop = "t"
, respectively. To omit proportions altogether,use prop = "n"
.
3.2 Minimal Cross-Tabulations
By âswitching offâ all optional features, we get a simple â2 x 2âtable:
with(tobacco, print(ctable(x = smoker, y = diseased, prop = 'n', totals = FALSE, headings = FALSE), method = "render"))
diseased | ||
---|---|---|
smoker | Yes | No |
Yes | 125 | 173 |
No | 99 | 603 |
3.3 Chi-Square (đ2), Odds Ratio and Risk Ratio
To display the chi-square statistic, set chisq = TRUE
.For 2 x 2 tables, use OR
and RR
toshow odds ratio and risk ratio (also called relative risk),respectively. Those can be set to TRUE
, in which case 95%confidence intervals are shown; to use different confidence levels, usefor example OR = .90
.
Using pipes generally makes it easier to generate ctable() results. |
library(magrittr)tobacco %$% # Acts like with(tobacco, ...) ctable(x = smoker, y = diseased, chisq = TRUE, OR = TRUE, RR = TRUE, headings = FALSE) %>% print(method = "render")
diseased | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
smoker | Yes | No | Total | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Yes | 125 | ( | 41.9% | ) | 173 | ( | 58.1% | ) | 298 | ( | 100.0% | ) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
No | 99 | ( | 14.1% | ) | 603 | ( | 85.9% | ) | 702 | ( | 100.0% | ) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Total | 224 | ( | 22.4% | ) | 776 | ( | 77.6% | ) | 1000 | ( | 100.0% | ) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Χ2 = 91.7088df = 1p = .0000 O.R. (95% C.I.) = 4.40(3.22 - 6.02) R.R. (95% C.I.) = 2.97(2.37 - 3.73) |
<< 3. Cross-Tabs: ctable() |TOC | 5. Data Frame Summaries:dfSummary() >>
descr()
generates descriptive / univariate statistics,i.e. common central tendency statistics and measures ofdispersion. It accepts single vectors as well as data frames; in thelatter case, all non-numerical columns are ignored, with a message tothat effect.
descr(iris)
Non-numerical variable(s) ignored: Species
Descriptive Statistics
iris
N: 150
Petal.Length | Petal.Width | Sepal.Length | Sepal.Width | |
---|---|---|---|---|
Mean | 3.76 | 1.20 | 5.84 | 3.06 |
Std.Dev | 1.77 | 0.76 | 0.83 | 0.44 |
Min | 1.00 | 0.10 | 4.30 | 2.00 |
Q1 | 1.60 | 0.30 | 5.10 | 2.80 |
Median | 4.35 | 1.30 | 5.80 | 3.00 |
Q3 | 5.10 | 1.80 | 6.40 | 3.30 |
Max | 6.90 | 2.50 | 7.90 | 4.40 |
MAD | 1.85 | 1.04 | 1.04 | 0.44 |
IQR | 3.50 | 1.50 | 1.30 | 0.50 |
CV | 0.47 | 0.64 | 0.14 | 0.14 |
Skewness | -0.27 | -0.10 | 0.31 | 0.31 |
SE.Skewness | 0.20 | 0.20 | 0.20 | 0.20 |
Kurtosis | -1.42 | -1.36 | -0.61 | 0.14 |
N.Valid | 150.00 | 150.00 | 150.00 | 150.00 |
Pct.Valid | 100.00 | 100.00 | 100.00 | 100.00 |
To turn off the variable-type messages, usesilent = TRUE
. It is possible to set that option globally,which we will do here, so it wonât be displayed in the remaining of thisvignette.
st_options(descr.silent = TRUE)
4.1 Transposing and Selecting Statistics
Results can be transposed by using transpose = TRUE
, andstatistics can be selected using the stats
argument:
descr(iris, stats = c("mean", "sd"), transpose = TRUE, headings = FALSE)
Mean | Std.Dev | |
---|---|---|
Petal.Length | 3.76 | 1.77 |
Petal.Width | 1.20 | 0.76 |
Sepal.Length | 5.84 | 0.83 |
Sepal.Width | 3.06 | 0.44 |
See ?descr
for a list of all available statistics.Special values âallâ, âfivenumâ, and âcommonâ are also valid. Thedefault value is âallâ, and it can be modified usingst_options()
:
st_options(descr.stats = "common")
<< 4. Descriptive Statistics with descr() |TOC | 6. Grouped Statistics: stby()>>
dfSummary()
creates a summary table with statistics,frequencies and graphs for all variables in a data frame. Theinformation displayed is type-specific (character, factor, numeric,date) and also varies according to the number of distinct values.
To see the results in RStudioâs Viewer (or in the default Web browserif working in another IDE or from a terminal window), use theview()
function, or its twin stview()
in caseof name conflicts:
view(dfSummary(iris))
Be careful to use view() to point to View() ;loading summarytools after these packages willensure its own view() works properly. Otherwise,stview() is always there as a foolproof alternative. |
5.1 Using dfSummary() in R Markdown Documents
When using dfSummary()
in R Markdown documents,it is generally a good idea to exclude a column or two to avoid marginoverflow. Since the Valid and Missing columns areredundant, we can drop either one of them.
dfSummary(tobacco, plain.ascii = FALSE, style = "grid", graph.magnif = 0.75, valid.col = FALSE, tmp.img.dir = "/tmp")
The tmp.img.dir
parameter is mandatorywhen generating dfSummaries in R Markdown documents,except for html rendering. The explanation for this can befound further below.
Some users reported repeated X11 warnings; those can be avoided bysetting the warning chunk option to FALSE :{r chunk_name, results="asis", warning=FALSE} . |
5.2 Optional Statistics
This feature has been requested several times since the package wasreleased. Introduced in version 1.0.0, it provides control over whichstatistics to shown in the Stats/Values column. Namely, thethird row, which displays IQR (CV)
, can be modified to showany available statistics in R. An additional âslotâ (unused by default)is also made available. To use this feature, definedfSummary.custom.1
and/or dfSummary.custom.2
using st_options()
in the following way, encapsulating thecode in an expression()
:
st_options( dfSummary.custom.1 = expression( paste( "Q1 - Q3 :", round( quantile(column_data, probs = .25, type = 2, names = FALSE, na.rm = TRUE), digits = 1 ), " - ", round( quantile(column_data, probs = .75, type = 2, names = FALSE, na.rm = TRUE), digits = 1 ) ) ))print( dfSummary(iris, varnumbers = FALSE, na.col = FALSE, style = "multiline", plain.ascii = FALSE, headings = FALSE, graph.magnif = .8), method = "render")
Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sepal.Length[numeric] |
| 35 distinct values | 150(100.0%) | ||||||||||||||||
Sepal.Width[numeric] |
| 23 distinct values | 150(100.0%) | ||||||||||||||||
Petal.Length[numeric] |
| 43 distinct values | 150(100.0%) | ||||||||||||||||
Petal.Width[numeric] |
| 22 distinct values | 150(100.0%) | ||||||||||||||||
Species[factor] |
|
| 150(100.0%) |
If we had used dfSummary.custom.2
instead ofdfSummary.custom.1
, a fourth row would have been addedunder the default IQR (CV)
row.
Note that instead of round()
, it is possible to use theinternal format_number()
, which ensures the number isformatted according to all specified arguments (rounding digits, decimalmark and thousands mark, etc.). The internal variableround.digits
which contains the value ofst_options("round.digits")
can also be used. This is howthe default IQR (CV)
is defined â here we set the firstcustom stat back to its default value and then display its definition(formatR::tidy_source()
is used to format / indent theexpression):
library(formatR)st_options(dfSummary.custom.1 = "default")formatR::tidy_source( text = deparse(st_options("dfSummary.custom.1")), indent = 2, args.newline = TRUE)
expression( paste( paste0( trs("iqr"), " (", trs("cv"), ") : " ), format_number( IQR(column_data, na.rm = TRUE), round.digits ), " (", format_number( sd(column_data, na.rm = TRUE)/mean(column_data, na.rm = TRUE), round.digits ), ")", collapse = "", sep = "" ))
Donât forget to specify na.rm = TRUE for all functions thatuse this parameter (most of base R functions do). |
5.3 Other Notable Features
The dfSummary()
function also
- Reports the number of duplicate records in the heading section
- Detects UPC/EAN codes (barcode numbers) and doesnât calculateirrelevant statistics for them
- Detects email addresses and reports counts of valid, invalid andduplicate addresses; note that the proportions of valid and invalid sumup to 100%; the duplicates proportion is calculated independently, whichis why in the bar chart (html version), the bar for thiscategory is shown with a different color
- Allows the display of âwindowedâ results by using the
max.tbl.height
parameter; This is especially convenient ifthe analyzed data frame has numerous variables; seevignette("rmarkdown", package = "summarytools")
for moredetails
5.4 Excluding Columns
Although most columns can be excluded using the functionâsparameters, it is also possible to delete them with the following syntax(results not shown):
dfs <- dfSummary(iris)dfs$Variable <- NULL # This deletes the "Variable" column
<< 5. Data Frame Summaries |TOC | 7. Grouped Statistics:group_by() >>
To produce optimal results, summarytools has its ownversion of the base by()
function. Itâs calledstby()
, and we use it exactly as we wouldby()
:
(iris_stats_by_species <- stby(data = iris, INDICES = iris$Species, FUN = descr, stats = "common", transpose = TRUE))
Descriptive Statistics
iris
Group: Species = setosa
N: 50
Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
---|---|---|---|---|---|---|---|
Petal.Length | 1.46 | 0.17 | 1.00 | 1.50 | 1.90 | 50.00 | 100.00 |
Petal.Width | 0.25 | 0.11 | 0.10 | 0.20 | 0.60 | 50.00 | 100.00 |
Sepal.Length | 5.01 | 0.35 | 4.30 | 5.00 | 5.80 | 50.00 | 100.00 |
Sepal.Width | 3.43 | 0.38 | 2.30 | 3.40 | 4.40 | 50.00 | 100.00 |
Group: Species = versicolor
N: 50
Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
---|---|---|---|---|---|---|---|
Petal.Length | 4.26 | 0.47 | 3.00 | 4.35 | 5.10 | 50.00 | 100.00 |
Petal.Width | 1.33 | 0.20 | 1.00 | 1.30 | 1.80 | 50.00 | 100.00 |
Sepal.Length | 5.94 | 0.52 | 4.90 | 5.90 | 7.00 | 50.00 | 100.00 |
Sepal.Width | 2.77 | 0.31 | 2.00 | 2.80 | 3.40 | 50.00 | 100.00 |
Group: Species = virginica
N: 50
Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
---|---|---|---|---|---|---|---|
Petal.Length | 5.55 | 0.55 | 4.50 | 5.55 | 6.90 | 50.00 | 100.00 |
Petal.Width | 2.03 | 0.27 | 1.40 | 2.00 | 2.50 | 50.00 | 100.00 |
Sepal.Length | 6.59 | 0.64 | 4.90 | 6.50 | 7.90 | 50.00 | 100.00 |
Sepal.Width | 2.97 | 0.32 | 2.20 | 3.00 | 3.80 | 50.00 | 100.00 |
6.1 Special Case of descr() with stby()
When used to produce split-group statistics for a single variable,stby()
assembles everything into a single table instead ofdisplaying a series of one-column tables.
with(tobacco, stby(data = BMI, INDICES = age.gr, FUN = descr, stats = c("mean", "sd", "min", "med", "max")))
Descriptive Statistics
BMI by age.gr
Data Frame: tobacco
N: 258
18-34 | 35-50 | 51-70 | 71 + | |
---|---|---|---|---|
Mean | 23.84 | 25.11 | 26.91 | 27.45 |
Std.Dev | 4.23 | 4.34 | 4.26 | 4.37 |
Min | 8.83 | 10.35 | 9.01 | 16.36 |
Median | 24.04 | 25.11 | 26.77 | 27.52 |
Max | 34.84 | 39.44 | 39.21 | 38.37 |
6.2 Using stby() with ctable()
The syntax is a little trickier for this combination, so here is anexample (results not shown):
stby(data = list(x = tobacco$smoker, y = tobacco$diseased), INDICES = tobacco$gender, FUN = ctable)# or equivalentlywith(tobacco, stby(data = list(x = smoker, y = diseased), INDICES = gender, FUN = ctable))
<< 6. Grouped Statistics : group_by() |TOC | 8. Tidy Tables : tb()>>
To create grouped statistics with freq()
,descr()
or dfSummary()
, it is possible to usedplyrâs group_by()
as an alternative tostby()
. Syntactic differences aside, one key distinction isthat group_by()
considers NA
values on thegrouping variable(s) as a valid category, albeit with a warningsuggesting the use of forcats::fct_explicit_na
to makeNA
âs explicit in factors. Following this advice, weget:
library(dplyr)tobacco$gender %<>% forcats::fct_explicit_na()tobacco %>% group_by(gender) %>% descr(stats = "fivenum")
Warning: package 'dplyr' was built under R version 4.1.3
Descriptive Statistics
tobacco
Group: gender = F
N: 489
BMI | age | cigs.per.day | samp.wgts | |
---|---|---|---|---|
Min | 9.01 | 18.00 | 0.00 | 0.86 |
Q1 | 22.98 | 34.00 | 0.00 | 0.86 |
Median | 25.87 | 50.00 | 0.00 | 1.04 |
Q3 | 29.48 | 66.00 | 10.50 | 1.05 |
Max | 39.44 | 80.00 | 40.00 | 1.06 |
Group: gender = M
N: 489
BMI | age | cigs.per.day | samp.wgts | |
---|---|---|---|---|
Min | 8.83 | 18.00 | 0.00 | 0.86 |
Q1 | 22.52 | 34.00 | 0.00 | 0.86 |
Median | 25.14 | 49.50 | 0.00 | 1.04 |
Q3 | 27.96 | 66.00 | 11.00 | 1.05 |
Max | 36.76 | 80.00 | 40.00 | 1.06 |
Group: gender = (Missing)
N: 22
BMI | age | cigs.per.day | samp.wgts | |
---|---|---|---|---|
Min | 20.24 | 19.00 | 0.00 | 0.86 |
Q1 | 24.97 | 36.00 | 0.00 | 1.04 |
Median | 27.16 | 55.50 | 0.00 | 1.05 |
Q3 | 30.23 | 64.00 | 10.00 | 1.05 |
Max | 32.43 | 80.00 | 28.00 | 1.06 |
<< 7. Grouped Statistics : group_by() |TOC | 9. Directing Output to Files>>
When generating freq()
or descr()
tables,it is possible to turn the results into âtidyâ tables with the use ofthe tb()
function (think of tb as a diminutive fortibble). For example:
library(magrittr)iris %>% descr(stats = "common") %>% tb()
# A tibble: 4 x 8 variable mean sd min med max n.valid pct.valid <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>1 Petal.Length 3.76 1.77 1 4.35 6.9 150 1002 Petal.Width 1.20 0.762 0.1 1.3 2.5 150 1003 Sepal.Length 5.84 0.828 4.3 5.8 7.9 150 1004 Sepal.Width 3.06 0.436 2 3 4.4 150 100
iris$Species %>% freq(cumul = FALSE, report.nas = FALSE) %>% tb()
# A tibble: 3 x 3 Species freq pct <fct> <dbl> <dbl>1 setosa 50 33.32 versicolor 50 33.33 virginica 50 33.3
By definition, no total rows are part of tidy tables, andthe row names are converted to a regular column.
When displaying tibbles using rmarkdown, theknitr chunk option results should be setto âmarkupâ instead of âasisâ. |
8.1 Tidy Split-Group Statistics
Here are some examples showing how lists created usingstby()
or group_by()
can be transformed intotidy tibbles.
grouped_descr <- stby(data = exams, INDICES = exams$gender, FUN = descr, stats = "common")grouped_descr %>% tb()
# A tibble: 12 x 9 gender variable mean sd min med max n.valid pct.valid <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Girl economics 72.5 7.79 62.3 70.2 89.6 14 93.3 2 Girl english 73.9 9.41 58.3 71.8 93.1 14 93.3 3 Girl french 71.1 12.4 44.8 68.4 93.7 14 93.3 4 Girl geography 67.3 8.26 50.4 67.3 78.9 15 100 5 Girl history 71.2 9.17 53.9 72.9 86.4 15 100 6 Girl math 73.8 9.03 55.6 74.8 86.3 14 93.3 7 Boy economics 75.2 9.40 60.5 71.7 94.2 15 100 8 Boy english 77.8 5.94 69.6 77.6 90.2 15 100 9 Boy french 76.6 8.63 63.2 74.8 94.7 15 100 10 Boy geography 73 12.4 47.2 71.2 96.3 14 93.311 Boy history 74.4 11.2 54.4 72.6 93.5 15 100 12 Boy math 73.3 9.68 60.5 72.2 93.2 14 93.3
The order
parameter controls row ordering:
grouped_descr %>% tb(order = 2)
# A tibble: 12 x 9 gender variable mean sd min med max n.valid pct.valid <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Girl economics 72.5 7.79 62.3 70.2 89.6 14 93.3 2 Boy economics 75.2 9.40 60.5 71.7 94.2 15 100 3 Girl english 73.9 9.41 58.3 71.8 93.1 14 93.3 4 Boy english 77.8 5.94 69.6 77.6 90.2 15 100 5 Girl french 71.1 12.4 44.8 68.4 93.7 14 93.3 6 Boy french 76.6 8.63 63.2 74.8 94.7 15 100 7 Girl geography 67.3 8.26 50.4 67.3 78.9 15 100 8 Boy geography 73 12.4 47.2 71.2 96.3 14 93.3 9 Girl history 71.2 9.17 53.9 72.9 86.4 15 100 10 Boy history 74.4 11.2 54.4 72.6 93.5 15 100 11 Girl math 73.8 9.03 55.6 74.8 86.3 14 93.312 Boy math 73.3 9.68 60.5 72.2 93.2 14 93.3
Setting order = 3
changes the order of the sortvariables exactly as with order = 2
, but it also reordersthe columns:
grouped_descr %>% tb(order = 3)
# A tibble: 12 x 9 variable gender mean sd min med max n.valid pct.valid <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 economics Girl 72.5 7.79 62.3 70.2 89.6 14 93.3 2 economics Boy 75.2 9.40 60.5 71.7 94.2 15 100 3 english Girl 73.9 9.41 58.3 71.8 93.1 14 93.3 4 english Boy 77.8 5.94 69.6 77.6 90.2 15 100 5 french Girl 71.1 12.4 44.8 68.4 93.7 14 93.3 6 french Boy 76.6 8.63 63.2 74.8 94.7 15 100 7 geography Girl 67.3 8.26 50.4 67.3 78.9 15 100 8 geography Boy 73 12.4 47.2 71.2 96.3 14 93.3 9 history Girl 71.2 9.17 53.9 72.9 86.4 15 100 10 history Boy 74.4 11.2 54.4 72.6 93.5 15 100 11 math Girl 73.8 9.03 55.6 74.8 86.3 14 93.312 math Boy 73.3 9.68 60.5 72.2 93.2 14 93.3
For more details, see ?tb
.
8.2 A Bridge to Other Packages
summarytools objects are not always compatible withpackages focused on table formatting, such as formattable orkableExtra.However, tb()
can be used as a âbridgeâ, an intermediarystep turning freq()
and descr()
objects intosimple tables that any package can work with. Here is an example usingkableExtra:
library(kableExtra)library(magrittr)stby(data = iris, INDICES = iris$Species, FUN = descr, stats = "fivenum") %>% tb(order = 3) %>% kable(format = "html", digits = 2) %>% collapse_rows(columns = 1, valign = "top")
variable | Species | min | q1 | med | q3 | max |
---|---|---|---|---|---|---|
Petal.Length | setosa | 1.0 | 1.4 | 1.50 | 1.6 | 1.9 |
Petal.Length | versicolor | 3.0 | 4.0 | 4.35 | 4.6 | 5.1 |
Petal.Length | virginica | 4.5 | 5.1 | 5.55 | 5.9 | 6.9 |
Petal.Width | setosa | 0.1 | 0.2 | 0.20 | 0.3 | 0.6 |
Petal.Width | versicolor | 1.0 | 1.2 | 1.30 | 1.5 | 1.8 |
Petal.Width | virginica | 1.4 | 1.8 | 2.00 | 2.3 | 2.5 |
Sepal.Length | setosa | 4.3 | 4.8 | 5.00 | 5.2 | 5.8 |
Sepal.Length | versicolor | 4.9 | 5.6 | 5.90 | 6.3 | 7.0 |
Sepal.Length | virginica | 4.9 | 6.2 | 6.50 | 6.9 | 7.9 |
Sepal.Width | setosa | 2.3 | 3.2 | 3.40 | 3.7 | 4.4 |
Sepal.Width | versicolor | 2.0 | 2.5 | 2.80 | 3.0 | 3.4 |
Sepal.Width | virginica | 2.2 | 2.8 | 3.00 | 3.2 | 3.8 |
<< 8. Tidy Tables : tb() |TOC | 10. Global Options>>
Using the file
argument with print()
orview()
/ stview()
, we can write outputs to afile, be it html, Rmd, md, or just plain text(txt). The file extension is used by the package to determinethe type of content to write out.
view(iris_stats_by_species, file = "~/iris_stats_by_species.html")view(iris_stats_by_species, file = "~/iris_stats_by_species.md")
A Note on PDF documents
There is no direct way to create a PDF file withsummarytools. One option is to generate anhtml file and convert it to PDF using Pandoc or WK<html>TOpdf (thelatter gives better results than Pandoc withdfSummary()
output).
Another option is to create an Rmd document usingPDF as the output format. Seevignette("rmarkdown", package = "summarytools")
for thedetails on how to proceed.
9.1 Appending Output Files
The append
argument allows adding content to existingfiles generated by summarytools. This is useful if wewish to include several statistical tables in a single file. It is aquick alternative to creating an Rmd document.
<< 9. Directing Output to Files |TOC | 11. Format Attributes>>
The following options can be set globally withst_options()
:
10.1 General Options
Option name | Default | Note |
---|---|---|
style (1) | âsimpleâ | Set to ârmarkdownâ in .Rmd documents |
plain.ascii | TRUE | Set to FALSE in .Rmd documents |
round.digits (2) | 2 | Number of decimals to show |
headings | TRUE | Formerly âomit.headingsâ |
footnote | âdefaultâ | Customize or set to NA to omit |
display.labels | TRUE | Show variable / data frame labels in headings |
bootstrap.css (3) | TRUE | Include Bootstrap 4 CSS in html outputfiles |
custom.css | NA | Path to your own CSS file |
escape.pipe | FALSE | Useful for some Pandoc conversions |
char.split (4) | 12 | Threshold for line-wrapping in column headings |
subtitle.emphasis | TRUE | Controls headings formatting |
lang | âenâ | Language (always 2-letter, lowercase) |
1 Does not apply to dfSummary()
, which hasits own style option (see next table)
2 Does not apply to ctable()
, which has its ownround.digits
option (see next table)
3 Set to FALSE
in Shiny apps
4 Affects only html outputs for descr()
and ctable()
10.2 Function-Specific Options
Option name | Default | Note |
---|---|---|
freq.cumul | TRUE | Display cumulative proportions in freq() |
freq.totals | TRUE | Display totals row in freq() |
freq.report.nas | TRUE | Display |
freq.ignore.threshold (1) | 25 | Used to determine which vars to ignore |
freq.silent | FALSE | Hide console messages |
ctable.prop | ârâ | Display row proportions bydefault |
ctable.totals | TRUE | Show marginal totals |
ctable.round.digits | 1 | Number of decimals to show inctable() |
descr.stats | âallâ | âfivenumâ, âcommonâ or vector of stats |
descr.transpose | FALSE | Display stats in columns instead of rows |
descr.silent | FALSE | Hide console messages |
dfSummary.style | âmultilineâ | Can be set to âgridâ as an alternative |
dfSummary.varnumbers | TRUE | Show variable numbers in 1st col. |
dfSummary.labels.col | TRUE | Show variable labels when present |
dfSummary.graph.col | TRUE | Show graphs |
dfSummary.valid.col | TRUE | Include the Valid column in the output |
dfSummary.na.col | TRUE | Include the Missing column in the output |
dfSummary.graph.magnif | 1 | Zoom factor for bar plots and histograms |
dfSummary.silent | FALSE | Hide console messages |
tmp.img.dir (2) | NA | Directory to store temporaryimages |
use.x11 (3) | TRUE | Allow creation of Base64-encoded graphs |
1 See section 2.3 fordetails
2 Applies to dfSummary()
only
3 Set to FALSE in text-only environments
Examples
st_options() # Display all global options valuesst_options('round.digits') # Display the value of a specific optionst_options(style = 'rmarkdown', # Set the value of one or several options footnote = NA) # Turn off the footnote for all html output
<< 10. Global Options |TOC | 12. Fine-Tuning Looks : CSS>>
When a summarytools object is created, itsformatting attributes are stored within it. However, we can overridemost of them when using print()
or view()
.
11.1 Overriding Function-Specific Arguments
The following table indicates what arguments can be used withprint()
or view()
to override formattingattributes. Base Râs format()
function arguments can alsobe used (although they are not listed here).
Argument | freq | ctable | descr | dfSummary |
---|---|---|---|---|
style | x | x | x | x |
round.digits | x | x | x | |
plain.ascii | x | x | x | x |
justify | x | x | x | x |
headings | x | x | x | x |
display.labels | x | x | x | x |
varnumbers | x | |||
labels.col | x | |||
graph.col | x | |||
valid.col | x | |||
na.col | x | |||
col.widths | x | |||
totals | x | x | ||
report.nas | x | |||
display.type | x | |||
missing | x | |||
split.tables (1) | x | x | x | x |
caption (1) | x | x | x | x |
1 pander options
11.2 Overriding Heading Contents
To change the information shown in the heading section, use thefollowing arguments with print()
orview()
:
Argument | freq | ctable | descr | dfSummary |
---|---|---|---|---|
Data.frame | x | x | x | x |
Data.frame.label | x | x | x | x |
Variable | x | x | x | |
Variable.label | x | x | x | |
Group | x | x | x | x |
date | x | x | x | x |
Weights | x | x | ||
Data.type | x | |||
Row.variable | x | |||
Col.variable | x |
Example
In the following example, we will create and display afreq()
object, and then display it again, this timeoverriding three of its formatting attributes, as well as one of itsheading attributes.
(age_stats <- freq(tobacco$age.gr))
Frequencies
tobacco$age.gr
Type: Factor
Freq | % Valid | % Valid Cum. | % Total | % Total Cum. | |
---|---|---|---|---|---|
18-34 | 258 | 26.46 | 26.46 | 25.80 | 25.80 |
35-50 | 241 | 24.72 | 51.18 | 24.10 | 49.90 |
51-70 | 317 | 32.51 | 83.69 | 31.70 | 81.60 |
71 + | 159 | 16.31 | 100.00 | 15.90 | 97.50 |
<NA> | 25 | 2.50 | 100.00 | ||
Total | 1000 | 100.00 | 100.00 | 100.00 | 100.00 |
print(age_stats, report.nas = FALSE, totals = FALSE, display.type = FALSE, Variable.label = "Age Group")
Frequencies
tobacco$age.gr
Label: Age Group
Freq | % | % Cum. | |
---|---|---|---|
18-34 | 258 | 26.46 | 26.46 |
35-50 | 241 | 24.72 | 51.18 |
51-70 | 317 | 32.51 | 83.69 |
71 + | 159 | 16.31 | 100.00 |
11.3 Order of Priority for Parameters / Options
print()
orview()
parameters haveprecedence (overriding feature)freq() / ctable() / descr() / dfSummary()
parameterscome second- Global options set with
st_options()
come third and actas default
The logic for the evaluation of the various parameter values can besummarized as follows:
If an argument is explicitly supplied in the function call, itwill have precedence over any stored value for the parameter (storedvalues are the ones that are written to the objectâs attributes whenusing a core function, as well as the ones stored insummarytoolsâ global options list).
If both a core function and the print or view function are calledat once and have conflicting parameter values, print/view has precedence(they always win the argument!).
if the parameter values cannot be found in the function calls,the stored defaults (modified with
st_options()
or left asthey are when loading the package) will be applied.
<< 11. Format Attributes |TOC | 13. Shiny Apps>>
When creating html reports, both Bootstrapâs CSS andsummarytools.css are included by default. For greater controlon the looks of html content, it is also possible to add classdefinitions in a custom CSS file.
Example
We need to use a very small font size for a simple htmlreport containing a dfSummary()
. For this, we create a.css file (with the name of our choosing) which contains thefollowing class definition:
.tiny-text { font-size: 8px;}
Then we use print()
âs custom.css
argumentto specify to location of our newly created CSS file (resultsnot shown):
print(dfSummary(tobacco), custom.css = 'path/to/custom.css', table.classes = 'tiny-text', file = "tiny-tobacco-dfSummary.html")
<< 12. Fine-Tuning Looks : CSS |TOC | 14. Graphs in R Markdown>>
To successfully include summarytools functions inShiny apps,
- use html rendering
- set
bootstrap.css = FALSE
to avoid interacting with theappâs layout - set
headings = FALSE
in case problems arise - adjust graph sizes with the
graph.magnif
parameter orwith thedfSummary.graph.magnif
global option - if
dfSummary()
tables are too wide, omit a column ortwo (valid.col
andvarnumbers
, forinstance) - if the results are still unsatisfactory, set column widths manuallywith the
col.widths
parameter - if
col.widths
orgraph.magnig
do not seemto work, try using them as parameters forprint()
ratherthandfSummary()
Example (results not shown)
print(dfSummary(somedata, varnumbers = FALSE, valid.col = FALSE, graph.magnif = 0.8), method = 'render', headings = FALSE, bootstrap.css = FALSE)
<< 13. Shiny Apps |TOC |
When using dfSummary()
in an Rmd document usingmarkdown styling (as opposed to html rendering), threeelements are needed in order to display the png graphsproperly:
1 - plain.ascii
must be set to FALSE
2 - style
must be set to âgridâ
3 - tmp.img.dir
must be defined and be at most 5 characterswide
Note that as of version 0.9.9, settingtmp.img.dir
is no longer required whenusing method = "render"
and can be left toNA
. It is only necessary to define it when a transitorymarkdown table must be created, as shown below. Note how narrow theGraph column is â this is actually required, since the width ofthe rendered column is determined by the number of characters in thecell, rather than the width of the image itself:
+---------------+--------|----------------------+---------+| Variable | stats | Graph | Valid |+===============+========|======================+=========+| age\ | ... | ![](/tmp/ds0001.png) | 978\ || [numeric] | ... | | (97.8%) |+---------------+--------+----------------------+---------+
CRAN policies are really strict when it comes to writing content inthe user directories, or anywhere outside Râs temporary zone(for good reasons). So users need to set this temporary locationthemselves, therefore consenting to having content written outsideRâs predefined temporary zone.
On Mac OS and Linux, using â/tmpâ makes a lot of sense: itâs a shortpath, and the directory is purged automatically. On Windows, there is nosuch convenient directory, so we need to pick one â be it absolute(â/tmpâ) or relative (âimgâ, or simply â.â).
<< 14. Graphs in R Markdown |TOC | 16. Vignette Setup>>
Thanks to the R communityâs efforts, the following languagescan be used, in addition to English (default):
- French (fr)
- Portuguese (pt)
- Russian (ru)
- Spanish (es)
- Turkish (tr)
To switch languages, simply use
st_options(lang = "fr")
All output from the core functions will now use that language:
freq(iris$Species)
Tableau de frĂŠquences
iris$Species
Type: Facteur
FrĂŠq. | % Valide | % Valide cum. | % Total | % Total cum. | |
---|---|---|---|---|---|
setosa | 50 | 33.33 | 33.33 | 33.33 | 33.33 |
versicolor | 50 | 33.33 | 66.67 | 33.33 | 66.67 |
virginica | 50 | 33.33 | 100.00 | 33.33 | 100.00 |
<NA> | 0 | 0.00 | 100.00 | ||
Total | 150 | 100.00 | 100.00 | 100.00 | 100.00 |
15.1 Non-UTF-8 Locales
On most Windows systems, it is necessary to change theLC_CTYPE
element of the locale settings if the characterset is not included in the systemâs default locale. For instance, inorder to get good results with the Russian language in a âlatin1âenvironment, use the following settings:
Sys.setlocale("LC_CTYPE", "russian")st_options(lang = 'ru')
To go back to default settingsâŚ
Sys.setlocale("LC_CTYPE", "")st_options(lang = "en")
15.2 Defining and Using Custom Terms
Using the function use_custom_lang()
, it is possible toadd your own set of translations or personalized terms. To achieve this,get the csvtemplate, customize one, many or all of the +/- 70 terms, and calluse_custom_lang()
, giving it as sole argument the path tothe edited csv template. Note that such custom languagesettings will not persist across R sessions. This means that you shouldalways have this csv file handy for future use.
15.3 Defining Only Specific Keywords
The define_keywords()
makes it easy to change just oneor a few terms. For instance, you might prefer using âNâ or âCountârather than âFreqâ in the title row of freq()
tables. Oryou might want to generate a document which uses the tablesâ titles asheading sections.
For this, call define_keywords()
and feed it the term(s)you wish to modify (which can themselves be stored in predefinedvariables). Here, the terms we need to change arefreq.title
and freq
:
section_title <- "**Species of Iris**"define_keywords(title.freq = section_title, freq = "N")freq(iris$Species)
Species of Iris
iris$Species
Type: Facteur
N | % Valide | % Valide cum. | % Total | % Total cum. | |
---|---|---|---|---|---|
setosa | 50 | 33.33 | 33.33 | 33.33 | 33.33 |
versicolor | 50 | 33.33 | 66.67 | 33.33 | 66.67 |
virginica | 50 | 33.33 | 100.00 | 33.33 | 100.00 |
<NA> | 0 | 0.00 | 100.00 | ||
Total | 150 | 100.00 | 100.00 | 100.00 | 100.00 |
Calling define_keywords()
without any arguments willbring up, on systems that support graphical devices (the vast majority,that is), a window from which we can edit all the terms we want.
After closing the edit window, a dialogue box gives the option tosave the newly created custom language to a csv file (eventhough we changed just a few keywords, the package considers the termsas a whole). We can later reload into memory the custom language file bycallinguse_custom_lang("path-to-custom-language-file.csv")
.
See ?define_keywords
for a list of all customizableterms in the package.
To revert all changes, we can simply usest_options(lang = "en")
.
15.4 Power-Tweaking Headings
It is possible to further customize the headings by adding argumentsto the print()
function. Here, we use an empty string tooverride the value of Variable
; this causes the second lineof the heading to disappear altogether.
define_keywords(title.freq = "Types and Counts, Iris Flowers")print( freq(iris$Species, display.type = FALSE), # Variable type won't be displayed... Variable = "" # and neither will the variable name )
Types and Counts, Iris Flowers
N | % Valide | % Valide cum. | % Total | % Total cum. | |
---|---|---|---|---|---|
setosa | 50 | 33.33 | 33.33 | 33.33 | 33.33 |
versicolor | 50 | 33.33 | 66.67 | 33.33 | 66.67 |
virginica | 50 | 33.33 | 100.00 | 33.33 | 100.00 |
<NA> | 0 | 0.00 | 100.00 | ||
Total | 150 | 100.00 | 100.00 | 100.00 | 100.00 |
| TOC | 17.Conclusion >>
Knowing how this vignette is configured can help you get started withusing summarytools in R Markdowndocuments.
16.1 The YAML Section
The output element is the one that matters:
---output: rmarkdown::html_vignette: css: - !expr system.file("rmarkdown/templates/html_vignette/resources/vignette.css", package = "rmarkdown")---
16.2 The Setup Chunk
```{r setup, include=FALSE} library(knitr)opts_chunk$set(results = 'asis', # Can also be set at chunk level comment = NA, prompt = FALSE, cache = FALSE)library(summarytools)st_options(plain.ascii = FALSE, # Always use in Rmd documents style = "rmarkdown", # Always use in Rmd documents subtitle.emphasis = FALSE) # Improves layout w/ some themes```
16.3 Including summarytoolsâ CSS
The needed CSS is automatically added to html files createdusing print()
or view()
with thefile
argument. But in R Markdown documents, thisneeds to be done explicitly in a setup chunk just after the YAML header(or following a first setup chunk specifying knitr andsummarytools options):
```{r, echo=FALSE} st_css(main = TRUE, global = TRUE)```
<< 16. Vignette Setup |TOC
The package comes with no guarantees. It is a work in progress andfeedback is always welcome. Please open an issue onGitHub if you find a bug or wish to submit a feature request.
Stay Up to Date
Check out the GitHub projectâspage; from there you can see the latest updates and also submitfeature requests.
For a preview of whatâs coming in the next release, have a look atthe developmentbranch.
TOC