Data Visualization in R

EPI 553 | Principles of Statistical Inference II

Muntasir Masum, PhD

2026-04-23

Roadmap

Foundations

Why visualization is a methods topic
The grammar of graphics
Healy’s principles: honesty, clarity, comparison
Holtz’s chart-type decision tree
Scherer’s editorial approach
A worked transformation: same data, four iterations

Modern Techniques

Color, type, and accessibility
gghighlight, animation, interactivity
Distributions, uncertainty, heatmaps
Patchwork, annotation, spatial viz
Visualizing regression models
The #30DayChartChallenge
Building your own reusable theme

Part 1

Why Visualization Is a Methods Topic

Statistics gives us numbers. Visualization makes them mean something.

Three Reasons

Visualization is part of the analysis, not decoration. A scatterplot reveals a non-linear relationship that a correlation coefficient hides. A residual plot exposes a violated assumption that an R-squared celebrates.
Visualization is how findings reach decision-makers. A clinician, journalist, or policymaker will not read your beta coefficients. They will look at your figure. The figure is the one part of the paper that everyone reads.
Visualization is a discipline with its own theory. Bad charts are not just ugly, they are wrong. They make comparisons hard, hide variation, and encode noise as signal.

“Above all else, show the data.” – Edward Tufte

The Same Data, Completely Different Stories

Figure 1: Anscombe’s Quartet: four datasets with identical summary statistics but radically different patterns.

Every analysis you do should begin and end with looking at the data.

Part 2

The Grammar of Graphics

Leland Wilkinson (1999) described a unified framework. Hadley Wickham translated it into ggplot2 (2005).

The Seven Layers

Core structure

Data
- What data am I plotting?
- ggplot(data = ...)
Aesthetics
- Which variables map to x, y, color, size?
- aes(x, y, color, size)
Geometries
- What shape do I draw?
- geom_point(), geom_line(), geom_col()
Facets
- Do I split into small multiples?
- facet_wrap() / facet_grid()

Statistics
- Do I summarize or transform?
- stat_*() / geom_smooth()
Coordinates and scales
- How are axes and scales defined?
- coord_*() / scale_*()
Theme
- How does it look?
- theme_*() / theme()

Think of these as a checklist: data, mapping, marks, splits, summaries, scales, polish.

The genius is composability: to go from a scatterplot to a faceted scatterplot with a smoother, you add layers; you do not start over.

Building a Plot Layer by Layer

Show the code

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point(alpha = 0.7, size = 2.5) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
  facet_wrap(~ island) +
  scale_color_manual(values = epi_colors) +
  labs(title = "Bill Dimensions of Palmer Penguins",
       subtitle = "By species and island, with linear trend lines",
       x = "Bill length (mm)", y = "Bill depth (mm)",
       color = "Species")

Figure 2

The ggplot2 Template

Every ggplot follows the same skeleton:

ggplot(data = <DATA>,
       aes(x = <X>, y = <Y>, color = <Z>))
  + geom_<TYPE>(...)
  + facet_wrap(~ <VAR>)
  + scale_<AES>_<TYPE>(...)
  + labs(title = "...", x = "...", y = "...")
  + theme_minimal()

1: Data – what dataset am I plotting?
2: Aesthetics – which variables map to x, y, color, size?
3: Geometry – what shape do I draw?
4: Facets – do I split into small multiples?
5: Scales – how are axes and color legends defined?
6: Labels – title, subtitle, axis labels
7: Theme – how does it look?

You will use this template for every chart in this course and beyond.

Part 3

Kieran Healy’s Principles

Is the chart substantively good? Perceptually good? Aesthetically good?

Substantive Standards

The Rules

Show the data. Individual observations, not just summaries
Compare like with like. Stratify intentionally
Quantify uncertainty. Always show CIs or standard errors

Perceptual Standards

Cleveland & McGill (1984) ranked visual encodings by accuracy:

Figure 4

Practical Implication

Prefer dot plots and bar charts to pie charts. Prefer faceting to stacking. Use color for categories, not for quantities (unless you use a perceptually uniform scale like viridis).

Weak encoding (angle)

Strong encoding (position)

Aesthetic Standards

Healy’s third pillar is what beginners notice last but readers notice first:

Typography: one font family, bold title, grey subtitle, tiny caption
White space: let the chart breathe
Alignment: consistent margins and axes
Color harmony: restrained, purposeful palettes
Direct labeling: labels next to the data, not in a legend

A clean theme, restrained gridlines, and direct labeling will make a competent chart feel professional.

Part 4

Choosing the Right Chart Type

Yan Holtz’s From Data to Viz decision tree

Chart Type Decision Guide

Start with the data structure

One numeric Histogram, density, boxplot, violin
One categorical Bar chart, lollipop, treemap
Two numeric Scatterplot, hexbin, 2D density
Numeric x categorical Boxplot, violin, ridge, jitter + summary

Then ask what to avoid

Two categorical Heatmap or mosaic before grouped bars
Time series Prefer line, area, slope; avoid bars by default
Map data Normalize before choropleths
Network Avoid force-directed hairballs with large node counts

If the audience’s main task is comparison, choose a chart that encodes values by position.

Bookmark these: data-to-viz.com | r-graph-gallery.com | ggplot2 Geom Explorer

The Anti-Pie-Chart Argument

Show the code

dat <- tibble(group = LETTERS[1:6],
              value = c(22, 18, 17, 16, 14, 13))

p1 <- ggplot(dat, aes(x = "", y = value, fill = group)) +
  geom_col(width = 1) +
  coord_polar("y") +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Pie chart") +
  theme_void(base_size = 14) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

p2 <- ggplot(dat, aes(x = value, y = fct_reorder(group, value))) +
  geom_col(fill = "steelblue", width = 0.7) +
  geom_text(aes(label = value), hjust = -0.3, fontface = "bold") +
  scale_x_continuous(expand = expansion(mult = c(0, 0.1))) +
  labs(title = "Sorted bar chart", x = NULL, y = NULL) +
  theme_epi553()

p1 + p2 +
  plot_annotation(title = "Pie charts encode angles. Bar charts encode position.",
                  subtitle = "Which one lets you instantly rank the groups?",
                  theme = theme(plot.title = element_text(face = "bold", size = 16),
                                plot.subtitle = element_text(color = "grey40")))

Figure 5

Part 5

The Editorial Style

Cedric Scherer’s approach: from default to publication

The Scherer Approach

Start with the data, not the chart type
Strip ruthlessly: remove gridlines, grey backgrounds, redundant titles
Direct label: put labels next to the lines, not in a legend
Use type as design: bold title, italic subtitle, grey caption
Annotate: callouts, arrows, shaded regions
Iterate: 20+ versions is normal

The Tools

Package	Purpose
`ggtext`	Markdown in titles
`patchwork`	Multi-panel figures
`ggrepel`	Non-overlapping labels
`showtext`	Custom Google Fonts
`MetBrewer`	Art-inspired palettes
`ggdist`	Uncertainty + distributions

Editorial Example: Direct Labels Replace Legends

Show the ggplot2 code

library(ggrepel)

penguin_means <- penguins |>
  drop_na() |>
  summarise(mass = mean(body_mass_g), .by = c(species, year))

ggplot(penguin_means, aes(x = year, y = mass, color = species)) +
  geom_line(linewidth = 1.4) +
  geom_point(size = 3.5) +
  geom_text_repel(
    data = filter(penguin_means, year == max(year)),
    aes(label = species),
    hjust = 0, nudge_x = 0.15, direction = "y",
    segment.color = NA, fontface = "bold", size = 5
  ) +
  scale_color_manual(values = epi_colors) +
  scale_x_continuous(breaks = 2007:2009,
                     expand = expansion(mult = c(0.05, 0.25))) +
  labs(title = "Average Body Mass of Palmer Penguins, 2007-2009",
       subtitle = "Direct labels replace the legend; gridlines are softened",
       x = NULL, y = "Body mass (g)",
       caption = "Source: palmerpenguins R package") +
  theme(legend.position = "none",
        panel.grid.major.x = element_blank())

Figure 6

Part 6

A Worked Transformation

Same data. Four versions. Watch the evolution.

Iteration 1: The Default

ggplot(penguins, aes(x = species, y = body_mass_g)) +
  geom_boxplot() +
  theme_grey(base_size = 14)

Figure 7

Honest. Also forgettable.

Iteration 2: Show the Data

ggplot(penguins, aes(x = species, y = body_mass_g, fill = species)) +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.4, size = 1.5) +
  scale_fill_manual(values = epi_colors) +
  theme_epi553() +
  theme(legend.position = "none")

Figure 8

Now we see how many penguins are in each species and where the outliers actually live.

Iteration 3: Labels and Theme

ggplot(penguins, aes(x = species, y = body_mass_g, fill = species)) +
  geom_boxplot(alpha = 0.5, outlier.shape = NA, color = "grey20") +
  geom_jitter(width = 0.2, alpha = 0.4, color = "grey30", size = 1.5) +
  scale_fill_manual(values = epi_colors) +
  labs(title = "Body Mass of Palmer Penguins by Species",
       subtitle = "Boxplots with individual observations overlaid",
       x = NULL, y = "Body mass (g)") +
  theme(legend.position = "none")

Figure 9

Iteration 4: Editorial Polish

Show the code

library(ggdist)

ggplot(drop_na(penguins, body_mass_g),
       aes(x = species, y = body_mass_g, fill = species)) +
  stat_halfeye(adjust = 0.5, width = 0.6, .width = 0,
               justification = -0.3, point_color = NA) +
  geom_boxplot(width = 0.15, outlier.shape = NA, alpha = 0.7) +
  geom_jitter(width = 0.05, alpha = 0.3, size = 1.2) +
  scale_fill_manual(values = epi_colors) +
  coord_cartesian(xlim = c(1.2, NA), clip = "off") +
  labs(title = "Gentoo Penguins Are Substantially Heavier",
       subtitle = "Distribution, median, and individual observations of body mass by species",
       x = NULL, y = "Body mass (g)",
       caption = "Source: palmerpenguins R package") +
  theme(legend.position = "none",
        panel.grid.major.x = element_blank())

Figure 10

The final chart shows the distribution (raincloud), the summary (boxplot), the raw data (jitter), and a conclusion baked into the title. That is the difference between a chart and a finding.

Iterations 1 and 2: From Default to Honest

Figure 11

Iterations 3 and 4: From Clean to Editorial

Figure 12

Explore These Online

Start here

For inspiration

Foundations Summary

Build in layers Start with data, mapping, and geometry
Show the data Prefer raw observations plus a summary
Use strong encodings Position usually beats area and color
Quantify uncertainty CIs and distribution displays matter

Label directly Reduce legend lookups when you can
Strip clutter Remove anything that does not aid comparison
Write takeaway titles Titles should state the finding
Iterate Your fourth version is often the best one

Part 7

Color, Type, and Accessibility

Choosing color is a design and an ethics decision.

Color Principles

The Four Types of Color Scales

Categorical / Qualitative: distinct hues, no ranking
- Set2, Dark2, Okabe-Ito
Sequential: light to dark, one or two hues
- viridis, mako, Blues
Diverging: two hues meeting at a neutral midpoint
- RdBu, BrBG (only when there is a meaningful zero)
Highlight: one color for the story, grey for context

Why Colorblind Safety Matters

About 8% of men and 0.5% of women have some form of color vision deficiency.

Figure 14

Rule of thumb: use viridis-family palettes by default. Check with colorBlindness::cvdPlot() or coblis.com.

Typography and Direct Labeling

Typography Rules

One font family consistently
Title: bold, slightly larger
Subtitle: regular, grey
Caption: small, grey, data source
Avoid serifs in dense charts
Good sans-serifs: Atkinson Hyperlegible, Inter, Source Sans Pro

Direct Labeling

Put the label next to the thing it labels.

Legends force the eye to bounce between chart and key.

Direct labels eliminate that bounce.

# Instead of a legend:
geom_text_repel(
  data = filter(df, year == max(year)),
  aes(label = group)
)

The Visualization Checklist

Before you submit a chart, ask:

Does the title state the finding, not the topic?
Is the y-axis at zero (for bar charts) or appropriately scaled?
Are units labeled?
Is there a source caption?
Is the color palette colorblind-safe?
Are uncertainty intervals shown?
Could a reader explain the chart in one sentence?

Part 8

Modern Techniques and Advanced Aesthetics

The “show me what is possible” tour.

8.1 Highlighting with gghighlight

Many groups, but only a few matter to your story. Highlighting fades out the context.

library(gghighlight)

gapminder_like <- penguins |>
  drop_na() |>
  summarise(
    mean_mass = mean(body_mass_g),
    .by = c(species, year)
  )

ggplot(gapminder_like,
       aes(x = year, y = mean_mass,
           color = species)) +
  geom_line(linewidth = 1.4) +
  geom_point(size = 3.5) +
  gghighlight(
    species == "Gentoo",
    unhighlighted_params = list(
      color = "grey80",
      linewidth = 0.8)) +
  scale_color_manual(
    values = c("Gentoo" = "#2E86AB"))

The eye is drawn instantly to the highlighted line, but the reader still has full context.

8.2 Animation with gganimate

Time-series and longitudinal data come alive when animated. Build a standard ggplot, then add a transition_*() layer.

library(gganimate)
library(gapminder)

p <- ggplot(gapminder,
            aes(x = gdpPercap, y = lifeExp, size = pop, color = continent)) +
  geom_point(alpha = 0.7) +
  scale_x_log10() +
  scale_size(range = c(2, 12), guide = "none") +
  scale_color_brewer(palette = "Set2") +
  labs(title = "Year: {frame_time}",
       x = "GDP per capita (log scale)", y = "Life expectancy (years)") +
  theme_epi553() +
  transition_time(year) +        # <-- this is the only new line
  ease_aes("cubic-in-out")

animate(p, nframes = 100, fps = 10, width = 800, height = 500)

Key Transition Functions

Function	What it animates	Example
`transition_time()`	Continuous variable (year, day)	Gapminder bubble chart
`transition_states()`	Discrete variable, pauses on each	Before/after comparison
`transition_reveal()`	Progressive reveal along an axis	Drawing a time series

Tip from Cedric Scherer: animations should serve a narrative purpose. If the same insight is clearer in a static small-multiple, prefer the static version.

8.3 Interactive Charts

plotly: instant interactivity from any ggplot

library(plotly)

p <- ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g,
                          color = species,
                          text = paste0("Island: ", island,
                                        "<br>Sex: ", sex,
                                        "<br>Year: ", year))) +
  geom_point(alpha = 0.7, size = 2) +
  scale_color_manual(values = epi_colors) +
  labs(x = "Bill length (mm)", y = "Body mass (g)") +
  theme_epi553(base_size = 11)

ggplotly(p, tooltip = "text", height = 400)

Figure 16

8.3 Interactive Charts: ggiraph

ggiraph: finer control, hover effects, click events. Hover over points below!

Show the ggiraph code

library(ggiraph)

p <- ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g, color = species)) +
  geom_point_interactive(
    aes(tooltip = paste("Species:", species, "<br>Island:", island,
                        "<br>Mass:", body_mass_g, "g"),
        data_id = species),
    size = 2.5, alpha = 0.7
  ) +
  scale_color_manual(values = epi_colors) +
  labs(x = "Bill length (mm)", y = "Body mass (g)") +
  theme_epi553(base_size = 11)

girafe(ggobj = p, height_svg = 3.5, width_svg = 8,
       options = list(
         opts_hover(css = "stroke:black;stroke-width:2px;fill-opacity:1;"),
         opts_hover_inv(css = "opacity:0.2;")
       ))

Figure 17

8.4 Distributions: Raincloud Plots

Boxplots hide bimodality. Histograms hide group differences. Raincloud plots show everything.

Show the code

library(ggdist)

ggplot(drop_na(penguins, body_mass_g),
       aes(x = body_mass_g, y = species, fill = species)) +
  stat_halfeye(adjust = 0.6, .width = 0, justification = -0.2,
               point_color = NA, alpha = 0.7) +
  geom_boxplot(width = 0.15, outlier.shape = NA, alpha = 0.5) +
  geom_jitter(height = 0.07, alpha = 0.3, size = 1.2) +
  scale_fill_manual(values = epi_colors) +
  labs(title = "Raincloud Plot of Penguin Body Mass",
       subtitle = "Distribution + boxplot + raw observations in one chart",
       x = "Body mass (g)", y = NULL) +
  theme(legend.position = "none")

Figure 18

8.4 Distributions: Ridge Plots

When you want to compare distributions across many groups in a small space.

Show the code

library(ggridges)

ggplot(diamonds, aes(x = price, y = cut, fill = cut)) +
  geom_density_ridges(alpha = 0.8, scale = 1.1) +
  scale_x_log10(labels = label_dollar()) +
  scale_fill_viridis_d(option = "mako") +
  labs(title = "Distribution of Diamond Prices by Cut",
       subtitle = "Ridge plots make many distributions easy to compare",
       x = "Price (log scale)", y = NULL) +
  theme(legend.position = "none")

Figure 19

8.5 Uncertainty You Can See

Confidence intervals as skinny error bars often get missed. Show the whole distribution.

Show the code

library(ggdist)
library(distributional)

estimates <- tibble(
  predictor = c("Smoking", "Exercise", "Income", "Sleep", "Age"),
  estimate  = c(0.65, -0.45, -0.20, -0.30, 0.05),
  se        = c(0.10, 0.08, 0.07, 0.09, 0.04)
)

ggplot(estimates, aes(y = fct_reorder(predictor, estimate),
                      xdist = dist_normal(estimate, se))) +
  stat_halfeye(.width = c(0.5, 0.95), fill = "#2E86AB",
               slab_alpha = 0.6, point_size = 3.5) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey40") +
  labs(title = "Coefficient Plot with Full Sampling Distributions",
       subtitle = "Thick band = 50% interval, thin line = 95% interval",
       x = "Log odds ratio", y = NULL)

Figure 20

This is far more honest than a forest plot of point estimates with whiskers, because the reader can see the shape of the uncertainty.

8.6 Patchwork: Composing Multi-Panel Figures

Show the patchwork code

p_scatter <- ggplot(penguins, aes(bill_length_mm, body_mass_g, color = species)) +
  geom_point(alpha = 0.7, size = 2) +
  scale_color_manual(values = epi_colors) +
  theme(legend.position = "none") +
  labs(x = "Bill length (mm)", y = "Body mass (g)")

p_top <- ggplot(penguins, aes(bill_length_mm, fill = species)) +
  geom_density(alpha = 0.6) +
  scale_fill_manual(values = epi_colors) +
  theme_void() + theme(legend.position = "none")

p_right <- ggplot(penguins, aes(body_mass_g, fill = species)) +
  geom_density(alpha = 0.6) +
  scale_fill_manual(values = epi_colors) +
  coord_flip() +
  theme_void() + theme(legend.position = "none")

(p_top + plot_spacer() + p_scatter + p_right +
    plot_layout(ncol = 2, widths = c(4, 1), heights = c(1, 4))) +
  plot_annotation(
    title = "Marginal Density + Scatterplot",
    subtitle = "Composed with patchwork: p_top + plot_spacer() + p_scatter + p_right",
    theme = theme(plot.title = element_text(face = "bold", size = 16),
                  plot.subtitle = element_text(color = "grey40", family = "Fira Code",
                                               size = 11))
  )

Figure 21

Patchwork Syntax Cheat Sheet

The arithmetic is intuitive:

p1 + p2
p1 / p2
(p1 | p2) / p3
p1 + p2 + p3 + plot_layout(ncol = 2)
p1 + plot_annotation(title = "Combined Figure")

1: + places plots side by side
2: / stacks plots vertically
3: Parentheses group layouts: two on top, one below
4: plot_layout() controls the grid (rows, columns, widths)
5: plot_annotation() adds a shared title, subtitle, or caption

8.7 Annotation as a First-Class Citizen

The fastest way to make a chart “editorial”: annotate the finding directly on the chart.

Show the code

library(ggtext)

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point(alpha = 0.7, size = 2.2) +
  annotate("curve", x = 175, y = 5500, xend = 210, yend = 5100,
           arrow = arrow(length = unit(0.25, "cm")),
           curvature = -0.3, color = "grey30", linewidth = 0.6) +
  annotate("label", x = 173, y = 5650,
           label = "Gentoos cluster in the\nheavy / long-flipper corner",
           hjust = 0, size = 4.2, color = "grey20", lineheight = 0.95,
           fill = "#fff8e1", label.size = 0, fontface = "italic") +
  scale_color_manual(values = epi_colors) +
  labs(title = "Body mass scales with flipper length, but <span style='color:#46166B;'>**species matter**</span>",
       subtitle = "Annotated with a callout arrow and label box",
       x = "Flipper length (mm)", y = "Body mass (g)",
       color = NULL) +
  theme(plot.title = element_markdown(face = "bold", size = 16),
        legend.position = "top")

Figure 22

8.8 Spatial Visualization with sf

Public health data are almost always spatial. The sf package gives ggplot2 first-class support.

Show the spatial code

library(sf)
library(tigris)
options(tigris_use_cache = TRUE)

ny_counties <- counties(state = "NY", cb = TRUE, class = "sf")
ny_counties$fmd_prev <- runif(nrow(ny_counties), 8, 18)

ggplot(ny_counties) +
  geom_sf(aes(fill = fmd_prev), color = "white", linewidth = 0.2) +
  scale_fill_viridis_c(option = "rocket", direction = -1,
                       name = "FMD %") +
  labs(title = "Frequent Mental Distress Prevalence by NY County",
       subtitle = "Hypothetical data for demonstration",
       caption = "Source: U.S. Census Bureau (TIGER/Line)") +
  theme_void(base_size = 14) +
  theme(plot.title = element_text(face = "bold"),
        plot.subtitle = element_text(color = "grey40"))

Figure 23

8.9 Tables as Visualizations: gt + gtExtras

Sometimes the right visualization is a table. gt makes editorial-quality tables with inline plots.

Show the gt code

library(gt)
library(gtExtras)

penguins |>
  drop_na() |>
  summarise(
    n = n(),
    mean_mass = mean(body_mass_g),
    sd_mass = sd(body_mass_g),
    masses = list(body_mass_g),
    .by = species
  ) |>
  gt() |>
  gt_plt_dist(masses, type = "density", fill_color = "#2E86AB") |>
  fmt_number(mean_mass, decimals = 0) |>
  fmt_number(sd_mass, decimals = 0) |>
  cols_label(species = "Species", n = "N",
             mean_mass = "Mean (g)", sd_mass = "SD (g)",
             masses = "Distribution") |>
  tab_header(title = md("**Penguin Body Mass by Species**"),
             subtitle = "Summary statistics with inline density plots") |>
  gt_theme_538() |>
  tab_options(table.width = pct(90))

Table 1

Species	N	Mean (g)	SD (g)
Penguin Body Mass by Species
Summary statistics with inline density plots
Adelie	146	3,706	459
Gentoo	119	5,092	501
Chinstrap	68	3,733	384

8.10 Heatmaps and Correlation Matrices

Heatmaps turn a matrix of numbers into a pattern you can see. Essential for correlation tables, confusion matrices, and time-by-group summaries.

Show the code

# Correlation matrix of numeric penguin variables
cor_data <- penguins |>
  drop_na() |>
  select(where(is.numeric)) |>
  cor() |>
  as.data.frame() |>
  rownames_to_column("var1") |>
  pivot_longer(-var1, names_to = "var2", values_to = "cor")

ggplot(cor_data, aes(x = var1, y = var2, fill = cor)) +
  geom_tile(color = "white", linewidth = 0.8) +
  geom_text(aes(label = round(cor, 2),
                color = abs(cor) > 0.6),
            size = 4.5, fontface = "bold") +
  scale_fill_gradient2(low = "#D55E00", mid = "white", high = "#0072B2",
                       midpoint = 0, limits = c(-1, 1),
                       name = "Correlation") +
  scale_color_manual(values = c("TRUE" = "white", "FALSE" = "grey20"),
                     guide = "none") +
  scale_x_discrete(labels = \(x) str_replace_all(x, "_", "\n")) +
  scale_y_discrete(labels = \(x) str_replace_all(x, "_", "\n")) +
  labs(title = "Correlation Heatmap of Penguin Measurements",
       subtitle = "Color intensity encodes strength; text encodes exact value",
       x = NULL, y = NULL) +
  coord_fixed() +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 10),
        axis.text.y = element_text(size = 10),
        panel.grid = element_blank())

Figure 24

8.11 Dumbbell Charts: Showing Change

When you need to show the difference between two time points or conditions, dumbbell charts are more effective than grouped bars.

Show the code

# Simulated epi example: disease rates before/after intervention
intervention_data <- tibble(
  county = c("Albany", "Saratoga", "Rensselaer", "Schenectady",
             "Columbia", "Greene", "Warren", "Washington"),
  before = c(15.2, 12.8, 18.1, 16.5, 11.3, 14.7, 9.8, 13.2),
  after  = c(11.1, 10.2, 12.4, 13.8, 9.5, 11.0, 8.1, 10.9)
) |>
  mutate(change = after - before,
         county = fct_reorder(county, change))

ggplot(intervention_data) +
  geom_segment(aes(x = before, xend = after,
                   y = county, yend = county),
               color = "grey60", linewidth = 1.2) +
  geom_point(aes(x = before, y = county), color = "#D55E00",
             size = 4) +
  geom_point(aes(x = after, y = county), color = "#0072B2",
             size = 4) +
  annotate("text", x = 19, y = 8.3, label = "Before",
           color = "#D55E00", fontface = "bold", size = 4.5) +
  annotate("text", x = 19, y = 7.7, label = "After",
           color = "#0072B2", fontface = "bold", size = 4.5) +
  labs(title = "Every County Improved After the Intervention",
       subtitle = "Rate per 1,000 population, before vs. after community health program",
       x = "Rate per 1,000", y = NULL) +
  scale_x_continuous(limits = c(7, 20))

Figure 25

Dumbbell charts encode direction, magnitude, and rank simultaneously. Far more effective than grouped bar charts for before/after comparisons.

8.12 Build Your Own Reusable Theme

Package your style as a function. Reuse it across every chart.

theme_epi553 <- function(base_size = 14) {
  theme_minimal(base_size = base_size) +
    theme(
      plot.title = element_text(face = "bold", size = base_size + 4,
                                color = "#1a1a2e"),
      plot.subtitle = element_text(color = "grey40"),
      plot.caption = element_text(color = "grey60", size = base_size - 3,
                                  hjust = 0),
      panel.grid.minor = element_blank(),
      panel.grid.major = element_line(color = "grey92"),
      axis.title = element_text(color = "grey30"),
      strip.text = element_text(face = "bold"),
      legend.position = "top",
      plot.title.position = "plot",
      plot.caption.position = "plot"
    )
}

1: Wrap as a function so you can reuse it with one call
2: Start from theme_minimal() as a clean base
3: Typography: bold title, muted subtitle and caption
4: Grid cleanup: remove minor gridlines, soften major ones
5: Layout: legend on top, plot-aligned title and caption

Practical tip: drop theme_epi553() into a themes.R file. Source it from every analysis. Consistent figures without effort.

8.13 The Modern Workflow

Sketch first. On paper. What is the story?
Prototype in ggplot2 with default themes. Get the geometry right.
Iterate the encoding. Try three chart types. Pick one.
Layer in annotations. Title is the finding. Direct labels. Callouts.
Polish the theme. Fonts, colors, spacing.
Export at the right resolution.

ggsave("figure_1.png", width = 8, height = 5, dpi = 300, bg = "white")
ggsave("figure_1.svg", width = 8, height = 5)        # vector for editorial
ggsave("figure_1.pdf", width = 8, height = 5,
       device = cairo_pdf)                             # high-quality PDF

Step 7: Show it to a colleague. If they cannot explain the chart in 10 seconds, iterate.

Part 9

Visualizing Regression Models

You have spent the semester building models. Now make them visible.

The Problem with Regression Tables

What reviewers see

A wall of numbers:

Term	Estimate	SE	p
(Intercept)	-1.23	0.41	0.003
Smoking	0.65	0.10	<0.001
Exercise	-0.45	0.08	<0.001
Income	-0.20	0.07	0.004
Sleep	-0.30	0.09	0.001
Age	0.05	0.04	0.211

What readers understand

A coefficient plot communicates direction, magnitude, uncertainty, and significance in one glance. A table requires row-by-row mental math.

Forest Plots from Real Models

Fit a logistic regression on the penguins data and plot the odds ratios directly.

Show the full code

library(broom)

# Fit a logistic regression: predict heavy penguin (above median mass)
model_data <- penguins |>
  drop_na() |>
  mutate(heavy = as.integer(body_mass_g > median(body_mass_g)))

fit <- glm(heavy ~ bill_length_mm + bill_depth_mm + flipper_length_mm +
             species + sex,
           data = model_data, family = binomial)

# Tidy the model output (use Wald CIs for stability)
model_tidy <- tidy(fit, exponentiate = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    conf.low = exp(log(estimate) - 1.96 * std.error),
    conf.high = exp(log(estimate) + 1.96 * std.error),
    term = case_match(term,
      "bill_length_mm" ~ "Bill length (mm)",
      "bill_depth_mm" ~ "Bill depth (mm)",
      "flipper_length_mm" ~ "Flipper length (mm)",
      "speciesChinstrap" ~ "Chinstrap vs. Adelie",
      "speciesGentoo" ~ "Gentoo vs. Adelie",
      "sexmale" ~ "Sex (male vs. female)"
    ),
    significant = p.value < 0.05
  )

ggplot(model_tidy, aes(x = estimate, y = fct_reorder(term, estimate),
                       color = significant)) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "grey50") +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high),
                  size = 0.8, linewidth = 1.1) +
  geom_text(aes(label = sprintf("OR = %.2f", estimate)),
            vjust = -1, size = 3.8, show.legend = FALSE) +
  scale_color_manual(values = c("TRUE" = "#e64173", "FALSE" = "grey60"),
                     guide = "none") +
  scale_x_log10() +
  labs(title = "Predictors of Above-Median Body Mass (Logistic Regression)",
       subtitle = "Odds ratios with 95% CI on log scale; red = p < 0.05",
       x = "Odds Ratio (log scale)", y = NULL,
       caption = "Model: glm(heavy ~ bill + flipper + species + sex, family = binomial)")

Figure 27

Predicted Probability Curves

Show what your model predicts across the range of a key variable, holding others at their means.

Show the code

library(broom)

# Generate predictions across flipper length range
pred_grid <- tibble(
  flipper_length_mm = seq(170, 235, length.out = 200),
  bill_length_mm = mean(model_data$bill_length_mm),
  bill_depth_mm = mean(model_data$bill_depth_mm),
  species = "Adelie",
  sex = "female"
)

preds <- augment(fit, newdata = pred_grid, type.predict = "response",
                 se_fit = TRUE) |>
  mutate(lower = pmax(.fitted - 1.96 * .se.fit, 0),
         upper = pmin(.fitted + 1.96 * .se.fit, 1))

ggplot(preds, aes(x = flipper_length_mm, y = .fitted)) +
  geom_ribbon(aes(ymin = lower, ymax = upper),
              fill = "#2E86AB", alpha = 0.2) +
  geom_line(color = "#2E86AB", linewidth = 1.3) +
  geom_rug(data = model_data,
           aes(x = flipper_length_mm, y = heavy),
           sides = "tb", alpha = 0.15, color = "grey40") +
  scale_y_continuous(labels = label_percent()) +
  labs(title = "Predicted Probability of Above-Median Mass by Flipper Length",
       subtitle = "Logistic regression (Adelie, female); rug marks show observed data",
       x = "Flipper length (mm)", y = "Predicted probability",
       caption = "Shaded band = approximate 95% confidence interval")

Figure 28

Predicted probability curves are one of the most effective ways to communicate logistic regression results to non-statisticians.

Marginal Effects with ggeffects

The ggeffects package automates predicted value plots for any model class.

library(ggeffects)

# One line to get predicted values
preds <- ggpredict(fit,
  terms = c("flipper_length_mm",
            "sex"))

# Built-in plot method
plot(preds) +
  labs(
    title = "Marginal Effect of Flipper Length by Sex"
  )

# Or extract data for custom ggplot
as.data.frame(preds) |>
  ggplot(aes(x, predicted,
             color = group)) +
  geom_ribbon(...) +
  geom_line(...)

Diagnostic Plots: the `performance` + `see` Packages

The performance package (from easystats) provides model diagnostics; see visualizes them with ggplot2.

Show the code

# Fit a linear model for diagnostics demo
lm_fit <- lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species,
             data = drop_na(penguins))

library(performance)
library(see)

check_model(lm_fit, check = c("linearity", "normality", "qq", "homogeneity"))

Figure 30

check_model() replaces the base R plot(model) with a modern, multi-panel diagnostic dashboard. One function call, publication-ready output.

Model Comparison Visualization

When comparing nested or competing models, visualize the fit statistics side by side.

Show the code

# Fit competing models
m1 <- glm(heavy ~ flipper_length_mm, data = model_data, family = binomial)
m2 <- glm(heavy ~ flipper_length_mm + species, data = model_data, family = binomial)
m3 <- glm(heavy ~ flipper_length_mm + species + sex, data = model_data, family = binomial)
m4 <- fit  # full model from earlier

# Helper to get Wald CIs
tidy_wald <- function(mod, label) {
  tidy(mod, exponentiate = TRUE) |>
    mutate(conf.low = exp(log(estimate) - 1.96 * std.error),
           conf.high = exp(log(estimate) + 1.96 * std.error),
           model = label)
}

# Compare coefficients across models
models_tidy <- bind_rows(
  tidy_wald(m1, "Model 1:\nFlipper only"),
  tidy_wald(m2, "Model 2:\n+ Species"),
  tidy_wald(m3, "Model 3:\n+ Sex"),
  tidy_wald(m4, "Model 4:\n+ Bill measures")
) |>
  filter(term == "flipper_length_mm")

ggplot(models_tidy, aes(x = estimate, y = model)) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "grey50") +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high),
                  color = "#2E86AB", size = 1, linewidth = 1.1) +
  geom_text(aes(label = sprintf("OR = %.2f (%.2f, %.2f)",
                                estimate, conf.low, conf.high)),
            vjust = -1.2, size = 3.8, color = "grey30") +
  scale_x_log10() +
  labs(title = "How Stable Is the Flipper Length Effect Across Models?",
       subtitle = "Odds ratio for flipper_length_mm as covariates are added",
       x = "Odds Ratio (log scale)", y = NULL,
       caption = "Stable estimates across nested models suggest robust association")

Figure 31

The #30DayChartChallenge

Every April, the data visualization community participates in the #30DayChartChallenge: one prompt per day, one chart per day, shared on social media. Created in 2021 by Cedric Scherer and Dominic Roye, inspired by the #30DayMapChallenge.

Why It Matters for You

Forces you to try chart types you would never pick (waffle charts, bump charts, slope graphs, treemaps)
Builds a public portfolio of your work
Exposes you to how the global community solves the same prompt differently
Many prompts are epi-relevant: uncertainty, distributions, time series, part-to-whole, relationships

Five Categories (based on “The Graphic Continuum”)

Comparisons (days 1-6)
Distributions (days 7-12)
Relationships (days 13-18)
Time series (days 19-24)
Uncertainties (days 25-30)

Participants choose their own data and tools freely. The 2026 edition is at github.com/30DayChartChallenge/Edition2026.

Techniques You Can Steal

Prompt	Chart Type	R Package
Part-to-whole	Waffle chart	`waffle`
Ranking	Bump chart	`ggbump`
Slope	Slope chart	`geom_segment`
Circular	Polar bar	`coord_polar()`
Uncertainty	Gradient intervals	`ggdist`
Relationships	Network	`ggraph` + `tidygraph`
Neo-geometric	Voronoi	`ggforce`
Storytelling	Annotated timeline	`ggtext` + `annotate`

Where to Explore

github.com/30DayChartChallenge – all editions since 2021
Search #30DayChartChallenge on Twitter/X or Mastodon
Cedric Scherer’s contributions – R code for every entry
R Graph Gallery – reproducible R code for all chart types
ggplot2 Geom Explorer – find the right geom for your data

Bonus: Waffle Chart

A waffle chart is a part-to-whole alternative to pie charts, popularized by the #30DayChartChallenge. Each square = 1 unit.

Show the waffle code

library(waffle)

penguin_counts <- penguins |>
  drop_na() |>
  count(species) |>
  mutate(n_scaled = round(n / 5))  # each square = 5 penguins

waffle(
  c("Adelie" = penguin_counts$n_scaled[1],
    "Chinstrap" = penguin_counts$n_scaled[2],
    "Gentoo" = penguin_counts$n_scaled[3]),
  rows = 5,
  size = 1,
  colors = c("#FF6B35", "#A23B72", "#2E86AB"),
  title = "Palmer Penguins by Species",
  xlab = "1 square = 5 penguins"
) +
  theme(plot.title = element_text(face = "bold", size = 16),
        legend.position = "bottom")

Figure 32

Challenge for you: pick a prompt from the #30DayChartChallenge and create a chart using a dataset from this semester. Share it!

Lecture Summary

Build plots in layers
Show raw data whenever possible
Use uncertainty displays, not just point estimates
Prefer position over area for comparison
Make titles state the conclusion

Use colorblind-safe palettes by default
Direct-label when the audience benefits
Remove nonessential visual clutter
Visualize models, do not just tabulate them
Iterate until the story is obvious

References

Books

Healy, K. (2018). Data Visualization: A Practical Introduction. socviz.co
Wilke, C. O. (2019). Fundamentals of Data Visualization. clauswilke.com/dataviz
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. ggplot2-book.org

Foundational

Tufte, E. (2001). The Visual Display of Quantitative Information (2nd ed.)
Cleveland, W. S., & McGill, R. (1984). Graphical perception. JASA, 79(387), 531-554.

Online Resources

Holtz, Y. R Graph Gallery
Holtz, Y. & Healy, C. From Data to Viz
Holtz, Y. ggplot2 Geom Explorer
Scherer, C. Portfolio

Next Lecture (April 28)

Course Review: putting the entire semester together.

Thank You

EPI 553 | University at Albany

Data Visualization in R

Roadmap

Foundations

Modern Techniques

Part 1

Why Visualization Is a Methods Topic

Three Reasons

The Same Data, Completely Different Stories

Part 2

The Grammar of Graphics

The Seven Layers

Core structure

Refinement layers

Building a Plot Layer by Layer

The ggplot2 Template

Part 3

Kieran Healy’s Principles

Substantive Standards

The Rules

Perceptual Standards

Practical Implication

Weak encoding (angle)

Strong encoding (position)

Aesthetic Standards

Part 4

Choosing the Right Chart Type

Chart Type Decision Guide

Start with the data structure

Then ask what to avoid

The Anti-Pie-Chart Argument

Part 5

The Editorial Style

The Scherer Approach

The Tools

Editorial Example: Direct Labels Replace Legends

Part 6

A Worked Transformation

Iteration 1: The Default

Iteration 2: Show the Data

Iteration 3: Labels and Theme

Iteration 4: Editorial Polish

Iterations 1 and 2: From Default to Honest

Iterations 3 and 4: From Clean to Editorial

Explore These Online

Start here

For inspiration

Foundations Summary

Part 7

Color, Type, and Accessibility

Color Principles

The Four Types of Color Scales

Why Colorblind Safety Matters

Typography and Direct Labeling

Typography Rules

Direct Labeling

The Visualization Checklist

Part 8

Modern Techniques and Advanced Aesthetics

8.1 Highlighting with gghighlight

8.2 Animation with gganimate

Key Transition Functions

8.3 Interactive Charts

plotly: instant interactivity from any ggplot

8.3 Interactive Charts: ggiraph

8.4 Distributions: Raincloud Plots

8.4 Distributions: Ridge Plots

8.5 Uncertainty You Can See

8.6 Patchwork: Composing Multi-Panel Figures

Patchwork Syntax Cheat Sheet

8.7 Annotation as a First-Class Citizen

8.8 Spatial Visualization with sf

8.9 Tables as Visualizations: gt + gtExtras

8.10 Heatmaps and Correlation Matrices

8.11 Dumbbell Charts: Showing Change

8.12 Build Your Own Reusable Theme

8.13 The Modern Workflow

Part 9

Visualizing Regression Models

The Problem with Regression Tables

What reviewers see

Diagnostic Plots: the `performance` + `see` Packages