Plotting pixy output

A common method of visualizing population summary statistics is to create scatter or line plots of each statistic along the genome. Examples of how to do this using the tidyverse packge in R. are shown below.

Converting pixy output files to long format

The output files created by pixy are in wide format by default. Plotting in the tidyverse is much easier if data is in long format, and so the function below can be used to convert a list of pixy output files to a single long format data frame.

# Convert a set of pixy output files to a single long-format data frame.
# Handles both single-population stats (pi, watterson_theta, tajima_d) and
# pairwise stats (dxy, fst).
pixy_to_long <- function(pixy_files) {

  single_pop_stats <- c("pi", "watterson_theta", "tajima_d")
  pixy_df <- list()

  for (i in seq_along(pixy_files)) {

    # Extract the suffix from filenames like "pixy_watterson_theta.txt"
    stat_file_type <- sub("\\.txt$", "",
                          sub("^.*?_", "", basename(pixy_files[i])))

    df <- read_delim(pixy_files[i], delim = "\t")

    if (stat_file_type %in% single_pop_stats) {
      df <- df %>%
        gather(-pop, -window_pos_1, -window_pos_2, -chromosome,
               key = "statistic", value = "value") %>%
        rename(pop1 = pop) %>%
        mutate(pop2 = NA)
    } else {
      df <- df %>%
        gather(-pop1, -pop2, -window_pos_1, -window_pos_2, -chromosome,
               key = "statistic", value = "value")
    }

    pixy_df[[i]] <- df
  }

  bind_rows(pixy_df) %>%
    arrange(pop1, pop2, chromosome, window_pos_1, statistic)
}

For example, the below code converts all the pixy files found in the folder 'output' to a single data frame:

pixy_folder <- "output"
pixy_files <- list.files(pixy_folder, full.names = TRUE)
pixy_df <- pixy_to_long(pixy_files)

This results in a data frame (pixy_df) that looks like this:

pop1 pop2 chromosome window_pos_1 window_pos_2         statistic       value
BFS  KES          X            1        10000           avg_dxy 1.519710e-03
BFS  KES          X            1        10000        avg_wc_fst 1.224767e-01
BFS  KES          X            1        10000 count_comparisons 9.436012e+06
BFS  KES          X            1        10000       count_diffs 1.434000e+04
BFS  KES          X            1        10000     count_missing 2.649188e+06
BFS  KES          X            1        10000          no_sites 8.049000e+03
BFS  KES          X            1        10000           no_snps 1.310000e+02
BFS  KES          X        10001        20000           avg_dxy 2.453347e-03
BFS  KES          X        10001        20000        avg_wc_fst 1.479964e-01

Single chromosome view

We are often interested in local patterns of diversity along individual chromosomes. The below code plots the three main summary statistics calculated by pixy (pi, Dxy, FST) along a single chromosome. A custom labelling function is also provided to handle symbols/subscripts in the summary statistic's labels.

# custom labeller for special characters in the pixy summary stats
pixy_labeller <- as_labeller(c(avg_pi              = "pi",
                               avg_dxy             = "D[XY]",
                               avg_wc_fst          = "F[ST]",
                               avg_hudson_fst      = "F[ST]",
                               avg_watterson_theta = "theta[W]",
                               tajima_d            = "Tajima*minute*s~D"),
                             default = label_parsed)

# plotting summary statistics along a single chromosome
pixy_df %>%
  filter(chromosome == 1) %>%
  filter(statistic %in% c("avg_pi", "avg_dxy", "avg_wc_fst",
                          "avg_watterson_theta", "tajima_d")) %>%
  mutate(chr_position = ((window_pos_1 + window_pos_2)/2)/1000000) %>%
  ggplot(aes(x = chr_position, y = value, color = statistic))+
  geom_line(size = 0.25)+
  facet_grid(statistic ~ .,
             scales = "free_y", switch = "x", space = "free_x",
             labeller = labeller(statistic = pixy_labeller,
                                 value = label_value))+
  xlab("Position on Chromosome (Mb)")+
  ylab("Statistic Value")+
  theme_bw()+
  theme(panel.spacing = unit(0.1, "cm"),
        strip.background = element_blank(),
        strip.placement = "outside",
        legend.position = "none")+
  scale_x_continuous(expand = c(0, 0))+
  scale_y_continuous(expand = c(0, 0))+
  scale_color_brewer(palette = "Set1")

This results in the following plot :

A genome-wide plot of summary statistics

We can also visualize patterns of diversity at the genome wide scale. While finer scale patterns are lost, this can be useful for identifying chrosomal scale variation (e.g. depressed diversity on sex chromosomes), or visualizing the distribution of loci of interest (e.g. FST outliers, or GWAS hits). Some common features of these types of plots (alterating chromosome colors, enforced chromosome order, axis limits) are included.

# create a custom labeller for special characters in the pixy summary stats
pixy_labeller <- as_labeller(c(avg_pi              = "pi",
                               avg_dxy             = "D[XY]",
                               avg_wc_fst          = "F[ST]",
                               avg_hudson_fst      = "F[ST]",
                               avg_watterson_theta = "theta[W]",
                               tajima_d            = "Tajima*minute*s~D"),
                             default = label_parsed)

# plotting summary statistics across all chromosomes
pixy_df %>%
  mutate(chrom_color_group = case_when(as.numeric(chromosome) %% 2 != 0 ~ "even",
                                 chromosome == "X" ~ "even",
                                 TRUE ~ "odd" )) %>%
  mutate(chromosome = factor(chromosome, levels = c(1:22, "X", "Y"))) %>%
  filter(statistic %in% c("avg_pi", "avg_dxy", "avg_wc_fst",
                          "avg_watterson_theta", "tajima_d")) %>%
  ggplot(aes(x = (window_pos_1 + window_pos_2)/2, y = value, color = chrom_color_group))+
  geom_point(size = 0.5, alpha = 0.5, stroke = 0)+
  facet_grid(statistic ~ chromosome,
             scales = "free_y", switch = "x", space = "free_x",
             labeller = labeller(statistic = pixy_labeller,
                                 value = label_value))+
  xlab("Chromsome")+
  ylab("Statistic Value")+
  scale_color_manual(values = c("grey50", "black"))+
  theme_classic()+
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        panel.spacing = unit(0.1, "cm"),
        strip.background = element_blank(),
        strip.placement = "outside",
        legend.position ="none")+
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0), limits = c(0,NA))

This results in the following plot (using simulated data):