Visualizing transitions in time using R



Working with longitudinal data is very exciting as we can see how people, societies and institutions change in time. Often it helps to visualize this change in order to better understand what is happening and to help tell a good story. While visualizing change in time for continuous variables is relatively straightforward it is a little more difficult when we have categorical variables. One solution to this is the Sankey plot/river/alluvial graph. Here I’m going to show you how to create such a plot in R.

Here are the packages we will be using:

# package for data cleaning and graphs
library(tidyverse)

# package for alluvial plots
library(ggalluvial)

# optional

# nice themes
library(ggthemes)

# nice colors
library(viridis)

As an example data source I will be using the survey outcomes from five waves of the Understanding Society Innovation Panel. This is part of some methodological work we have been doing to understand how people answer using different modes of interview (like face to face and web) at different times in a longitudinal study.

# load local data
data <- read.csv("./data/ex_trans.csv")

# have a look at our data
head(data)

##   id          out_5   out_6          out_7          out_8      out_9
## 1  1            F2f     Web            Web            Web        Web
## 2  2            F2f     Web            Web            Web        Web
## 3  3 Other non-resp     Web            Web            Web        Web
## 4  4 Other non-resp     Web Other non-resp Other non-resp Not-issued
## 5  5            F2f     F2f            Web            Web        Web
## 6  6 Other non-resp Refusal            Web        Refusal        Web

Cleaning the data

So we have the data in wide format, where each row is an individual (“id” = individual) and then we have the outcomes at five waves (5 to 9).

The first thing we need to do is to summarize the data. What we need is a count of all the possible transitions that can take place in our data. There are a couple of ways of doing that but probably the easiest is to just count() the outcome variables. We also create an unique id which is just the row number in the dataset so we can easily identify the unique patterns later on.

%>% is called a pipe and just lets us chain multiple commands in a way that is easy to read. It takes the results from what is on the left of it and gives it as an input to the command to the right.

# calculate all combinations, make unique id and save
data2 <- count(data, out_5, out_6, out_7, out_8, out_9) %>% 
  mutate(id = row_number()) # create new id variable

# look at data
head(data2)

##   out_5 out_6 out_7      out_8          out_9  n id
## 1   F2f   F2f   F2f        F2f            F2f 73  1
## 2   F2f   F2f   F2f        F2f     Not-issued  6  2
## 3   F2f   F2f   F2f        F2f Other non-resp 10  3
## 4   F2f   F2f   F2f        F2f        Refusal  2  4
## 5   F2f   F2f   F2f        F2f            Web 12  5
## 6   F2f   F2f   F2f Not-issued            F2f  1  6

So far so good. We have a nice dataset with all possible combinations of transitions. In order to do the graph we need to restructure the data to the long format. Again, many ways to do that but the easiest one is to use the gather() command. Here we just say what is the data, the names of the two new variables and any variables we do not want to restructure (-n, -id tells R to ignore these variables when it restructures the data).

# make long data
data3 <- gather(data2, value, key, -n, -id) 

# look at data
head(data3)

##    n id value key
## 1 73  1 out_5 F2f
## 2  6  2 out_5 F2f
## 3 10  3 out_5 F2f
## 4  2  4 out_5 F2f
## 5 12  5 out_5 F2f
## 6  1  6 out_5 F2f

So the new data tells us the wave, the outcome in that wave as well as how many cases had that type of transition. We can link different types of transitions in the long format using the “id” variable.

Next we do some data cleaning. We make a continuous version of the wave variable, make the outcome a factor and change the order of the categories (to make the graph easier to interpret).

# clean up data for graph
data4 <- data3 %>%
  mutate(wave = as.numeric(str_remove(value, "out_")),
         key = as.factor(key),
         key = fct_relevel(key, "Web", "F2f", "Refusal",
                           "Other non-resp")) 

# look at the data
head(data4)

##    n id value key wave
## 1 73  1 out_5 F2f    5
## 2  6  2 out_5 F2f    5
## 3 10  3 out_5 F2f    5
## 4  2  4 out_5 F2f    5
## 5 12  5 out_5 F2f    5
## 6  1  6 out_5 F2f    5

Doing the graph

We have all we need for our graph now! The ggalluvial package helps us to make the graph we want but in the background it uses the ggplot package. This makes it very flexible and extensible (as we will see soon).

Using the ggplot() command we give it the dataset and then the main dimensions of the graph (in aes()):

  • on the x axis we want time, or “wave” in this case
  • on the y axis we want the frequency, or “n” in our case
  • we want the bar plots at each wave as well as the fill to depend on our outcome, which here is called “key”
  • alluvium is used to define the transitions. We use the “id” for that

We use two different “geoms” for our graph. First, geom_stratum() creates the bar at each wave. Second, geom_flow() creates the transitions between them. Here we make the graph slightly transparent to make it easier to read (alpha = .5).

So the syntax looks something like this:

# make basic transition graph
plot <- ggplot(data4, aes(x = wave, y = n, 
                          stratum = key, fill = key,
                          alluvium = id)) +
  geom_stratum(alpha = .5) +
  geom_flow() 

# print
plot
visulize transitions

This looks good but we can make it even better. We will choose a minimalistic theme (theme_tufte()), add labels and give it nicer colors. We can just add these to the previously saved object called (creatively) “plot”.

# enhance the look of the graph
plot +
  theme_tufte(base_size = 18) +
  labs(x = "Wave",
       y = "Frequency",
       fill = "Outcome") +
  scale_fill_viridis_d(direction = -1) 
visulize transitions

Much nicer. Now you can go out in the world with a graph like this…

So here is the full syntax to make the graph once we have the data prepared:

# full graph syntax
data4 %>%
  ggplot(aes(
    x = wave,
    stratum = key,
    alluvium = id,
    y = n,
    fill = key
  )) +
  geom_flow() +
  geom_stratum(alpha = .5) +
  theme_tufte(base_size = 18) +
  labs(x = "Wave",
       y = "Frequency",
       fill = "Outcome") +
  scale_fill_viridis_d(direction = -1)

If you enjoyed this have a look at my teaching page to learn more about bespoke training and upcoming courses.


Longitudinal Data Analysis Using R

If that was useful you might also like the Longitudinal Data Analysis Using R book.

This covers everything you need to work with longitudinal data. It introduces the key concepts related to longitudinal data, the basics of R and regression. It also shows using real data how to prepare, explore and visualize longitudinal data. In addition, it discusses in depth popular statistical models such as the multilevel model for change, the latent growth model and the cross-lagged model.


9 thoughts on “Visualizing transitions in time using R

  1. Caroline Reply

    Hi Alexandru,

    This introduction about creating alluvial graph is very interesting indeed! I would like to explore and replicate your results. I’ve been wondering about the input data structure. Would you mind giving me an example of the dataset? I would really appreciate it!

    Best,
    Caroline

    • Alexandru Cernat Post authorReply

      Thank you for the comment. The code below should be a full reproducible example using a subsample of 50 cases from the original data.

      # package for data cleaning and graphs
      library(tidyverse)
      # package for alluvial plots
      library(ggalluvial)
      # optional
      # nice themes
      library(ggthemes)
      # nice colors
      library(viridis)

      # toy data
      data <- structure( list( id = 1:50, out_5 = c( "F2f", "F2f", "Web", "F2f", "Web", "Web", "F2f", "Web", "F2f", "Web", "Web", "F2f", "F2f", "F2f", "F2f", "Other non-resp", "Web", "F2f", "F2f", "F2f", "Web", "Other non-resp", "Web", "F2f", "Web", "F2f", "F2f", "Web", "Other non-resp", "F2f", "Web", "Other non-resp", "F2f", "Refusal", "Web", "F2f", "Web", "Web", "Web", "Web", "Web", "F2f", "Web", "Web", "Other non-resp", "Other non-resp", "Web", "Web", "F2f", "Web" ), out_6 = c( "F2f", "F2f", "Web", "Web", "Web", "Web", "F2f", "Web", "Other non-resp", "Web", "Web", "F2f", "Refusal", "Web", "Web", "F2f", "Web", "Refusal", "Not-issued", "Web", "F2f", "Other non-resp", "Web", "Other non-resp", "Web", "F2f", "Web", "Other non-resp", "Other non-resp", "F2f", "Web", "Refusal", "Other non-resp", "Refusal", "Web", "Web", "Refusal", "Web", "F2f", "Web", "Web", "F2f", "Refusal", "Web", "F2f", "F2f", "Web", "Web", "F2f", "Web" ), out_7 = c( "F2f", "F2f", "F2f", "F2f", "Web", "Web", "F2f", "Web", "Refusal", "Web", "Web", "F2f", "Other non-resp", "Other non-resp", "Web", "Web", "Web", "Refusal", "Not-issued", "Web", "Web", "Other non-resp", "Web", "F2f", "Refusal", "Web", "Web", "Not-issued", "Other non-resp", "F2f", "Web", "Other non-resp", "Not-issued", "Other non-resp", "Web", "Web", "Other non-resp", "Web", "Web", "Web", "Web", "Web", "Web", "Web", "Refusal", "Refusal", "F2f", "Web", "Other non-resp", "Other non-resp" ), out_8 = c( "F2f", "F2f", "F2f", "F2f", "Web", "Web", "F2f", "Web", "F2f", "Web", "Web", "Web", "Not-issued", "Not-issued", "Web", "Web", "Web", "Not-issued", "Not-issued", "Web", "Web", "Other non-resp", "Web", "F2f", "Refusal", "Other non-resp", "Web", "Not-issued", "Other non-resp", "F2f", "Web", "Not-issued", "Not-issued", "Refusal", "Web", "Web", "Not-issued", "Web", "Web", "Web", "Web", "Web", "Web", "Web", "F2f", "Refusal", "Web", "Web", "Other non-resp", "Other non-resp" ), out_9 = c( "Not-issued", "F2f", "Other non-resp", "Web", "Web", "Web", "F2f", "Web", "Web", "Web", "Web", "Web", "Not-issued", "Not-issued", "Web", "Not-issued", "Web", "Not-issued", "Not-issued", "Web", "Web", "Other non-resp", "Web", "Not-issued", "Other non-resp", "Web", "Web", "Not-issued", "Other non-resp", "Other non-resp", "Web", "Not-issued", "Not-issued", "Refusal", "Web", "Web", "Not-issued", "Web", "Web", "Web", "Web", "Web", "Other non-resp", "Web", "Other non-resp", "Other non-resp", "Web", "Web", "Not-issued", "Not-issued" ) ), row.names = c(NA,-50L), class = "data.frame" )

      # calculate all combinations, make unique id and save
      data2 <- count(data, out_5, out_6, out_7, out_8, out_9) %>%
      mutate(id = row_number())

      # make long data
      data3 <- gather(data2, value, key,-n,-id)

      # clean up data for graph
      data4 <- data3 %>%
      mutate(
      wave = as.numeric(str_remove(value, "out_")),
      key = as.factor(key),
      key = fct_relevel(key, "Web", "F2f", "Refusal",
      "Other non-resp")
      )

      # full graph syntax
      data4 %>%
      ggplot(aes(
      x = wave,
      stratum = key,
      alluvium = id,
      y = n,
      fill = key
      )) +
      geom_flow() +
      geom_stratum(alpha = .5) +
      theme_tufte(base_size = 18) +
      labs(x = "Wave",
      y = "Frequency",
      fill = "Outcome") +
      scale_fill_viridis_d(direction = -1)

  2. Pingback: Estimating and visualizing change in time using Latent Growth Models with R - Alexandru Cernat

  3. Pingback: Estimating multilevel models for change in R - Alexandru Cernat

  4. Pingback: Estimating non-linear change with Latent Growth Models in R - Alexandru Cernat

  5. s Reply

    This article is useless until you post the data for people to practice

  6. Mohammad Reply

    Hi Alexandru,

    I would like to use your code for my data. But I have some Not-Issued data and I don’t want to display the Not-issued in the graph. Is there any way for that?

    Best,
    Mohammad

    • Alexandru Cernat Post authorReply

      Hello,

      You could just filter out the cases that you don’t want early in the process. For example, using the code above you could use the below after you import the data:
      data <- data %>%
      as_tibble() %>%
      filter(out_5 != "Not-issued",
      out_6 != "Not-issued",
      out_7 != "Not-issued",
      out_8 != "Not-issued",
      out_9 != "Not-issued")

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.