pixy 1.2.7.beta1

What is pixy?¶
pixy is a command line tool for calculating the population genetic summary statistics pi (average per site heterozygosity) and dxy (average number of nucleotide differences between populations per site) from a VCF file.
Many tools for computing pi and dxy from VCFs produce biased estimates in the presence of missing data. This is because these methods often make the simplifying assumption that if a genotype is missing, it is homozygous reference (0/0) by state. See pixy’s paper (https://doi.org/10.1111/1755-0998.13326) for more details.
About pixy¶
Authors: Katharine Korunes and Kieran Samuk, Duke University
pixy is a command-line tool for painlessly and correctly estimating average nucleotide diversity within (π) and between (dxy) populations from a VCF. In particular, pixy facilitates the use of VCFs containing invariant (AKA monomorphic) sites, which are essential for the correct computation of π and dxy:
pixy avoids common pitfalls in computing pi and dxy¶
Population geneticists are often interested in quantifying nucleotide diversity within and nucleotide differences between populations. The two most common summary statistics for these quantities were described by Nei and Li (1979), who discuss summarizing variation in the case of two populations (denoted ‘x’ and ‘y’):
- π - Average nucleotide diversity within populations, also sometimes denoted πx and πy to indicate the population of interest.
- dxy - Average nucleotide difference between populations, also sometimes denoted πxy (pixy, get it?), to indicate that the statistic is a comparison between populations x and y.
Many modern genomics tools calculate π and dxy from data encoded as VCFs, which by design often omit invariant sites. With variants-only VCFs, there is often no way to distinguish missing sites from invariant sites. The schematic below illustrates this problem and how pixy avoids it.

Figure 1 from Korunes & Samuk 2021.
In Case 1, all missing data is assumed to be present but invariant. This results in a deflated estimate of π. In Case 2, missing data are simply omitted from the calculation, both in terms of the number of sites (the final denominator) and the component denominators for each site (the n choose 2 terms). This results in an unbiased estimate of π. The adjusted π method (Case 2) is implemented for VCFs in pixy. Invariant sites are represented as sites with no ALT allele, and greyed-out sites are those that failed to pass a genotype filter requiring a minimum number of reads covering the site (Depth>=10 in this case).
Notable features of pixy¶
- Fast and efficient handing of invariant sites VCFs
- Computation of π and dxy for arbitrary numbers of populations
- Computes all statistics in arbitrarily sized windows, and output contains all raw information for all computations (e.g. numerators and denominators).
- The majority of this is made possible by extensive use of the existing data structures and functions found in the brilliant python library scikit-allel
Installation¶
Via Anaconda¶
pixy is currently available for installation on Linux/OSX systems via conda-forge. You can install it using following two conda install commands:
conda install -c conda-forge pixy
conda install -c bioconda htslib
You can test the installation by running:
pixy --help
Note
- For information on installing conda:
- anaconda (more features and initial modules): https://docs.anaconda.com/anaconda/install/ miniconda (lighter weight): https://docs.conda.io/en/latest/miniconda.html
Via Git or Download¶
Alternatively, you can clone from Github, with manual dependency installation: https://github.com/ksamuk/pixy
Arguments¶
Below is a list of required and optional arguments that pixy accepts.
Required:¶
- --stats [pi|fst|dxy]
- Which statistics to calculate from the VCF (pi, dxy, and/or fst, separated by spaces)
- --vcf [path/to/vcf.vcf.gz]
- Path to the input VCF (bgzipped and tabix indexed).
- --populations [path/to/populations_file.txt]
- Path to the populations file. See quick start for format.
In addition, one of either:¶
- --window_size [integer]
- Window size in base pairs over which to calculate pi/dxy.
- --bed_file [path/to/populations_file.txt]
- Path to a headerless .BED file containing regions (chrom, chromStart, chromEnd) over which to calculate summary statistics
Optional arguments:¶
- --n_cores [integer]
- Number of CPUs to utilize for parallel processing (default=1).
- --output_folder [path/to/output/folder]
- Folder where output will be written, e.g. path/to/output_folder, defaults to current working directory.
- --output_prefix [prefix]
- Optional prefix for output file(s), e.g. ‘output’ will result in writing to [output folder]/output_pi.txt, defaults to ‘pixy’.
- --chromosomes [‘list,of,chromosomes’]
- A single-quoted, comma separated list of chromosome(s) (e.g. ‘X,1,2’). Defaults to all chromosomes in the VCF.
- --interval_start [integer]
- The start of a specific interval over which to calculate pi/dxy. Only valid when calculating over a single chromosome. Defaults to 1.
- --interval_end [integer]
- The end of a specific interval over which to calculate pi/dxy. Only valid when calculating over a single chromosome. Defaults to the last position for a chromosome.
- --sites_file [path/to/sites_file.txt]
- Path to a tab separated file containing a headerless list of sites (CHROM, POS) to (exclusively) include in calculations
- --chunk_size [integer]
- Approximate number of sites to read from VCF at any given time. Defaults to 100000. Smaller numbers can reduce memory use.
- --fst_type [wc|hudson]
- FST estimator to use, one of either: ‘wc’ (Weir and Cockerham 1984) or ‘hudson’ (Hudson 1992, Bhatia et al. 2013). Defaults to ‘wc’
- --bypass_invariant_check [no|yes]
- Bypass the check for invariant sites. Use with caution!
- --version
- Print the pixy version number.
- --citation
- Print the citation for pixy.
- --help
- Print the help message.
- --silent
- Suppress all console output (flag, no value required).
An example:
pixy --stats pi fst dxy \
--vcf data/vcf/ag1000/chrX_36Ag_allsites.vcf.gz \
--populations Ag1000_sampleIDs_popfile.txt \
--window_size 10000 \
--n_cores 4 \
--output_folder output \
--output_prefix pixy_output
Companion files¶
Populations file¶
Description
The populations file is used to assign individuals to populations. For pairwise statistics (dxy and fst), all pairwise comparisons between populations are generated. It is specified at the command line with the --populations
argument.
Format
A headerless, tab separated file with the first column containing the sample name (as represented in the VCF), and the second column containing an arbitary population ID.
Example
ERS223827 BFS
ERS223759 BFS
ERS223750 BFS
ERS223967 AFS
ERS223970 AFS
ERS223924 AFS
ERS224300 AFS
ERS224168 KES
ERS224314 KES
BED file¶
Description
The BED file is used to manually defined genome regions over which to calculate summary statistics. It is specified at the command line with the --bed_file
.
Note that pixy expects intervals to be one indexed, not zero indexed as many typical BED files are. If your BED file is zero indexed, you’ll need to convert the intervals (e.g. by adding 1 to the start and end).
Format
A headerless, tab separated file with the first column containing a chromosome ID, the second column containing the first position of a window, and the third column containing the last potion of the window. Each row = one window.
Example
X 1 10000
X 10001 20000
X 20001 30000
X 30001 40000
X 40001 50000
X 50001 60000
1 1 10000
1 10001 20000
1 20001 30000
1 30001 40000
1 40001 50000
1 50001 60000
Sites file¶
Description
The sites is used to restrict calculations to specific sites (e.g. 4-fold degenerate sites, introns, etc.). It is specified at the command line with the --sites_file
.
Format
A headerless, tab separated file with the first column containing a chromosome ID (CHROM), the second column containing the position (POS) of a single site. Each row = one site.
Example
X 1
X 2
X 3
X 4
X 5
X 6
X 7
X 8
X 9
X 10
X 11
X 12
X 13
X 14
X 15
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
1 10
1 11
1 12
1 13
1 14
1 15
Changelog¶
Explanations of major changes to pixy are listed below.
For up to date info on minor versions and bugfixes, see the current release changelog at https://github.com/ksamuk/pixy/releases
pixy 1.0.0¶
To conincide with the publication of the pixy manuscript, we’re very happy to announce the release pixy version 1.0.0!
This is a major update to pixy, and includes number of major performance increases, new features, simplifications, and many minor fixes. Note that this version contains breaking changes, and old pipelines will need to be updated. We have also validated that the estimates of pi, dxy and fst produced by 1.0.0 are indentical to 0.95.02 (the verison used in the manuscript).
Summary of Major Changes¶
- All calculations are now much faster and natively parallelizable
- Memory usage vastly reduced
- BED and sites file support allows huge flexibility in windows/targeting sites of different classses of genomic elements
- Genotype filtration has been removed
- No change in the core summary statistics (pi, dxy, fst) produced by pixy
- htslib is now a hard dependency, and must be installed separately
- VCFs will need to be compressed with bgzip and indexed with tabix (from htslib) before using pixy
The performance increase and stability of numerical results can be seen in the following plots:


Detailed changelog¶
Major changes¶
- pixy calculations can now be fully parallelized by specifying
--n_cores [number of cores]
at the command line.
- Implemented using the multiprocessing module, which is now a hard dependency.
- Supported under both Linux and MacOS (using fork and spawn modes respectively).
- We’ve vectorized many of the core computations performed by pixy using numpy, resulting in significant performance gains.
- The memory usage of pixy is now vastly lower, more intelligently handled, and configurable by the user (via the –chunk_size argument).
- Large windows (e.g. whole chromosomes) are dynamically split into chunks and reassembled after summarization.
- Small windows are assigned to larger chunks to prevent I/O bottlenecks associated with frequently re-reading the source VCF.
New features¶
- Support for BED files specifying windows over which to calculate pi/dxy/fst. These windows can be heterogenous in size.
- This enables precisely matching pixy output with the output of e.g. another program
- Support for a tab-separate ‘sites file’ specifying sites (CHROM, POS) where summary statistics should be exclusively calculated
- This also enables e.g. estimates of pi using only 4-fold degenerate sites, or for only a particular class of genes, etc.
- Basic support for site-level statistics (1bp scale, but note that these are much slower to calculate compared to windowed statistcs)
Removed features¶
- pixy no longer makes use of a Zarr database for storing on-disk intermediate genotype information. We instead now perform random access of the VCF via tabix from htslib as implemented in scikit-allel. As such, htslib is now a hard dependency. We think tabix is a much more flexible system for many datasets, and the performance differences are negliable (and offset by the new performance features in v1.0). VCFs will need to be compressed with bgzip and indexed with tabix before using pixy.
- Other than requiring all variants to be biallelic SNPs, pixy no longer performs any filtration of any kind. We decided that filtration was outside the scope of the functionality we wanted pixy to have. Further, there are many excellent existing tools that perform filtration already and we felt we were “reinventing the wheel”. Further, pre-filtering creates a filtered VCF that can be used for other analyses, which users likely will want to do. We now strongly reccomend that users pre-filter their invariant sites VCFs using VCFtools and/or BCFtools. We provide an example shell script with this functionality (retaining invariant sites as required) as a template for users to edit for their needs.
Minor updates¶
- The pre-calculation checks performed by pixy are now more extensive and systematic.
- The method for calculating the number of valid sites has been slightly ajusted to be more accurate (this was calculated independantly of the pi/dxy/fst statistics).
- We’ve refactored and restructured much of the code, with a focus on increased functionalization. This should make community contributions and future updates much easier.
- To reduce confusion, output prefix and output folder are now separate arguments.
- The documentation for pixy as been extensively updated to reflect the new changes in version 1.0.0.
Other Bugfixes¶
- Total computation time is now properly displayed (issue ref).
- For FST: regions with no variant sites will now have “NA” in the output file, instead of not being represented.
Generating invariant sites VCFs¶
pixy facilitates the correct computation of π and dxy by allowing users to input data in a standard file format: Variant Call Format (VCF). AllSites VCFs contain invariant (AKA monomorphic) sites in addition to variant sites. Several commonly used, well-documented programs can generate AllSites VCFs.
Here, we provide a quickstart guide for generating AllSites VCFs using two of the most widely-used tools for variant discovery and manipulation: BCFtools and GATK. Either of these tools can be run given only a set of aligned data (BAM files) and the reference sequence used to align them.
BCFtools and GATK are also well-equipped to filter VCFs, and we recommend taking advantage of this to filter your data prior to analysis with pixy.
Note
Utilizing genomic intervals for improved runtime: If generation of an AllSites VCF is time-consuming, we recommend parallelizing your pipeline by breaking analyses down into smaller genomic regions. In our test datasets, we ran individual chromosomes in parallel. Depending on the size of chromosomes in a dataset, it may be beneficial to break down chromosomes into smaller intervals for variant calling. Genomic intervals can be specified using the -L parameter in GATK or the -r parameter in bcftools mpileup.
Generating AllSites VCFs using BCFtools (mpileup/call)¶
BCFtools mpileup can be used to produce genotype likelihoods, and this operation can be followed by bcftools call to call SNPs/INDELS. BCFtools offers a number of flexible options documented here: https://samtools.github.io/bcftools/bcftools-man.html#call
In this example, we call mpileup and pipe the output to call variants and generate and AllSites VCF:
bcftools mpileup -f <reference.fa> -b <bamlist.txt> -r <X> | bcftools call -m -Oz -f GQ -o <output>
Notes on the options selected here:
- b points mpileup to a list of BAM files contained in a text file.
- r specifies the genomic region. In this example, we specify the X chromosome, but mpileup provides a variety of options for specifying the region (CHR|CHR:POS|CHR:FROM-TO|CHR:FROM-[,…]). Alternatively, -R <file> can be used to read regions from a provided file.
- Oz specifies compressed VCF output
- f GQ indicates that the FORMAT field GQ should be output for each sample
Generating AllSites VCFs using GATK¶
GATK recommends first calling variants per-sample using HaplotypeCaller in GVCF mode (Step 1 below). Next, GenomicsDBImport consolidates information from GVCF files across samples to improve the efficiency joint genotyping (Step 2 below). In the 3rd step, GenotypeGVCFs produces a set of jointly-called SNPs and INDELS ready for filtering and analysis. We recommend consulting the full GATK documentation found here: https://software.broadinstitute.org/gatk/
Step 1 - HaplotypeCaller (this step should be run for each BAM file):
gatk-4.0.7.0/gatk --java-options "-Xmx4G" HaplotypeCaller \
-R <reference.fa> -I <input.bam> -O <output.g.vcf> -ERC GVCF -L <X>
Step 2 - GenomicsDBImport:
gatk-4.0.7.0/gatk --java-options "-Xmx4g" GenomicsDBImport \
-V $file1 -V $file2 --genomicsdb-workspace-path <allsamples_genomicsdb> \
-L <X>
Step 3 - GenotypeGVCFs:
gatk-4.0.7.0/gatk --java-options "-Xmx4g" GenotypeGVCFs \
-R <reference.fa> -V gendb://<allsamples_genomicsdb> \
-all-sites -L <X> -O <output-allsites.vcf.gz>
Notes on the options selected above:
- ERC GVCF sets the mode for emitting reference confidence scores. GVCF format specifies condensed non-variant blocks.
- L specifies the genomic region. In this example, we specify the X chromosome
- V specifies the variant data for input. In the case of GenomicsDBImport, this is GVCF file(s). In the case of GenotypeGVCFs, the variant data for joint genotyping is provided as a GenomicsDB workspace created with GenomicsDBImport in the previous step.
- all-sites indicates that the final VCF should include loci found to be non-variant after genotyping. The most important parameter for our purposes.
Important: In GATK v4.x, we have found that you must specify an interval (-L) when running GenotypeGVCFs. Without a designated interval, it appears to encounter missing reference confidence blocks, causing it to fail. This is true even when the problematic blocks are outside of the GenomicsDB interval being passed to it. We recommend always specifying an interval (-L) to avoid such issues.
Note
GATK versions: In the example above, we use GATK v4, but AllSites VCFs can also be easily generated in GATK v3.x by running GenotypeGVCFs with the “-allSites” parameter. (Note the slightly different syntax from “-all-sites” in GATK v4).
Step by Step Installation and Usage¶
Note
pixy is currently only available for Linux and macOS systems.
1. Generate a VCF with Invariant Sites and perform filtering¶
If you did not already generate an ‘allsites’ VCF (VCF with invariant sites), see the guide here.
Note
When working with whole genome data, we suggest you generate separate invariant sites VCFs for each chromosome. This is to prevent memory limitation issues down the line. This is less of an issue for reduced representation sequencing, naturally.
We recommend using the standard tools dedicated to performing such operations on VCFs: VCFtools and BCFtools (both available on bioconda).
Site-level filtration¶
A full treatment of VCF filtering for population genetics is beyond our scope here, but for a good start see: https://speciationgenomics.github.io/filtering_vcfs/.
At minumum, we reccomend filtering your VCF (a) using the GATK best practices site filtration procedure (if applicable) and (b) performing further filtering for missingness, site quality score, and mean depth (minimum and maxmium).
Here is an example using VCFtools. The specific values (especially for min/max-meanDP) will vary based on your dataset:
vcftools --gzvcf my_vcf.vcf.gz \
--remove-indels \
--max-missing 0.8 \
--min-meanDP 20 \
--max-meanDP 500 \
--recode --stdout | gzip -c > my_filtered_vcf.vcf.gz
Note
As of GATK version 4.2.0.0, GenotypeGVCFs only assigns QUAL scores to invariant sites if there are no missing genotypes. If you wish to filter on QUAL (–minQ), invariant and variant sites will need to be filtered separately, with the QUAL filter only applied to variant sites (see below for details).
Optional: Population genetic filters¶
Depending on your goal, you might also consider filtering out sites with strong HWE violations (try –hwe 0.001 with VCFtools), unusually high observed heterozygosity, or allelic depth imbalances. See this paper https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.12613 for more details on these considerations.
These last two considerations are particularly important if your study organism has high levels of gene duplication (e.g. re-diploidized after whole genome duplication as in many plant and fish species).
If your VCF contains both variant and invariant sites (as it should at this point), applying population genetic based filters will result in the loss of your invariant sites. To avoid this, filter the invariant and variant sites separately and concatenate the two resulting files. Below is an example of one way to achieve this using VCFtool and BCFtools:
#!/bin/bash # requires bcftools/bgzip/tabix and vcftools # create a filtered VCF containing only invariant sites vcftools --gzvcf test.vcf.gz \ --max-maf 0 \ [add other filters for invariant sites here] \ --recode --stdout | bgzip -c > test_invariant.vcf.gz # create a filtered VCF containing only variant sites vcftools --gzvcf test.vcf.gz \ --mac 1 \ [add other filters for variant sites here] \ --recode --stdout | bgzip -c > test_variant.vcf.gz # index both vcfs using tabix tabix test_invariant.vcf.gz tabix test_variant.vcf.gz # combine the two VCFs using bcftools concat bcftools concat \ --allow-overlaps \ test_variant.vcf.gz test_invariant.vcf.gz \ -O z -o test_filtered.vcf.gz
2. Install Anaconda¶
If you haven’t already, install Anaconda https://docs.anaconda.com/anaconda/install/
3. Create a New Environment¶
Create and activate a new conda environment for working with pixy:
conda create --name pixy
conda activate pixy
4. Install pixy¶
Install pixy via the conda-forge channel. Also install the required htslib package from bioconda.
conda install --yes -c conda-forge pixy
conda install --yes -c bioconda htslib
To see a list of arguments and test the pixy installation, type:
pixy --help
5. Create a populations file¶
Create a populations file. This is a headerless, tab-separated file where the first column contains sample names (exactly as represented in the VCF), and the second column contains population names (these can be anything, but should be consistent!).
For example:
ERS223827 BFS
ERS223759 BFS
ERS223750 BFS
ERS223967 AFS
ERS223970 AFS
ERS223924 AFS
ERS224300 AFS
ERS224168 KES
ERS224314 KES
6. Compress and Index your VCF¶
If you have not already, use bgzip and tabix to compress and index your VCF:
bgzip [your.file.vcf]
tabix [your.file.vcf.gz]
7. Run pixy¶
Run pixy! An example is shown below.
pixy --stats pi fst dxy \
--vcf data/vcf/ag1000/chrX_36Ag_allsites.vcf.gz \
--populations data/vcf/ag1000/Ag1000_sampleIDs_popfile.txt \
--window_size 10000 \
--n_cores 4 \
--chromosomes 'X'
Note
pixy ignores non-biallelic sites and INDELs, even if they are left in the VCF after pre-filtering.
8. Profit¶
Parse the output files and enjoy your unbiased estimates of pi and dxy!
9. Stay up to date¶
You can keep pixy up to date by re-running:
conda install --yes -c conda-forge pixy
You can check that you have the latest version via:
pixy --version
And comparing the version number to the one listed here: https://anaconda.org/conda-forge/pixy.
Usage Examples¶
Basic usage¶
pixy --stats pi fst dxy \
--vcf data/vcf/ag1000/chrX_36Ag_allsites.vcf.gz \
--populations Ag1000_sampleIDs_popfile.txt \
--window_size 10000 \
--n_cores 2
Focusing on a specific genomic interval¶
pixy --stats pi fst dxy \
--vcf data/vcf/ag1000/chrX_36Ag_allsites.vcf.gz \
--populations Ag1000_sampleIDs_popfile.txt \
--chromosomes 'X' \
--interval_start 15000 \
--interval_end 400000 \
--window_size 10000
Using a .BED file to define windows¶
When defined with a BED file, windows can be any size, and non-uniform.
pixy --stats pi fst dxy \
--vcf data/vcf/ag1000/chrX_36Ag_allsites.vcf.gz \
--populations Ag1000_sampleIDs_popfile.txt \
--bed_file genomic_windows.bed
Using a sites file to limit calculations to specific sites¶
Sites files define individual sites that will be included in the calculations. This can be used to obtain 4-fold degenerate pi, or statistics for particular classes of genomic elements (genes, introns, etc.).
pixy --stats pi fst dxy \
--vcf data/vcf/ag1000/chrX_36Ag_allsites.vcf.gz \
--populations Ag1000_sampleIDs_popfile.txt \
--sites_file valid_sites.txt
Site-level estimates¶
Note: site level estimates will be much slower to calculate than windowed estimates.
pixy --stats pi fst dxy \
--vcf data/vcf/ag1000/chrX_36Ag_allsites.vcf.gz \
--populations Ag1000_sampleIDs_popfile.txt \
--window_size 1 \
--chromosomes 'X' \
--interval_start 15000 \
--interval_end 400000
Understanding pixy output¶
Output file contents¶
pixy outputs a slightly different file type for each summary statistic it calculates. The contents of the columns of these output files are detailed below.
Within population nucleotide diversity (pi)¶
pop
- The ID of the population from the population file
chromosome
- The chromosome/contig
window_pos_1
- The first position of the genomic window
window_pos_2
- The last position of the genomic window
avg_pi
- Average per site nucleotide diversity for the window. More specifically, pixy computes the weighted average nucleotide diversity per site for all sites in the window, where the weights are determined by the number of genotyped samples at each site.
no_sites
- The total number of sites in the window that have at least one valid genotype. This statistic is included for the user, and not directly used in any calculations.
count_diffs
- The raw number of pairwise differences between all genotypes in the window. This is the numerator of avg_pi.
count_comparisons
- The raw number of non-missing pairwise comparisons between all genotypes in the window (i.e. cases where two genotypes were compared and both were valid). This is the denominator of avg_pi.
count_missing
- The raw number of missing pairwise comparisons between all genotypes in the window (i.e. cases where two genotypes were compared and at least one was missing).
Between population nucleotide divergence (dxy)¶
pop1
- The ID of the first population from the population file.
pop2
- The ID of the second population from the population file.
chromosome
- The chromosome/contig.
window_pos_1
- The first position of the genomic window.
window_pos_2
- The last position of the genomic window.
avg_dxy
- Average per site nucleotide divergence for the window.
no_sites
- The total number of sites in the window that have at least one valid genotype in both populations. This statistic is included for the user, and not directly used in any calculations.
count_diffs
- The raw number of pairwise, cross-population differences between all genotypes. This is the numerator of avg_dxy.
count_comparisons
- The raw number of non-missing pairwise cross-population comparisons between all genotypes in the window (i.e. cases where two genotypes were compared and both were valid). This is the denominator of avg_dxy.
count_missing
- The raw number of missing pairwise cross-population comparisons between all genotypes in the window (i.e. cases where two genotypes were compared and at least one was missing). This statistic is included for the user, and not directly used in any calculations.
Weir and Cockerham’s FST (fst)¶
pop1
- The ID of the first population from the population file.
pop2
- The ID of the second population from the population file.
chromosome
- The chromosome/contig.
window_pos_1
- The first position of the genomic window.
window_pos_2
- The last position of the genomic window.
avg_wc_fst
- Average Weir and Cockerham’s FST for the window (per SNP, not per site).
no_snps
- Total number of variable sites (SNPs) in the window.
Working with pixy output data¶
Plotting results¶
# Example R Script for simple output plots
# Here, we use pi and dxy output files directly from pixy.
library(ggplot2)
# Provide path to input. Can be pi or Dxy.
# NOTE: this is the only line you should have to edit to run this code:
inp<-read.table("pixy_dxy.txt",sep="\t",header=T)
# Find the chromosome names and order them: first numerical order, then any non-numerical chromosomes
# e.g., chr1, chr2, chr22, chrX
chroms <- unique(inp$chromosome)
chrOrder <- sort(chroms)
inp$chrOrder <- factor(inp$chromosome,levels=chrOrder)
# Plot pi for each population found in the input file
# Saves a copy of each plot in the working directory
if("avg_pi" %in% colnames(inp)){
pops <- unique(inp$pop)
for (p in pops){
thisPop <- subset(inp, pop == p)
# Plot stats along all chromosomes:
popPlot <- ggplot(thisPop, aes(window_pos_1, avg_pi, color=chrOrder)) +
geom_point()+
facet_grid(. ~ chrOrder)+
labs(title=paste("Pi for population", p))+
labs(x="Position of window start", y="Pi")+
scale_color_manual(values=rep(c("black","gray"),ceiling((length(chrOrder)/2))))+
theme_classic()+
theme(legend.position = "none")
ggsave(paste("piplot_", p,".png", sep=""), plot = popPlot, device = "png", dpi = 300)
}
} else {
print("Pi not found in this file")
}
# Plot Dxy for each combination of populations found in the input file
# Saves a copy of each plot in the working directory
if("avg_dxy" %in% colnames(inp)){
# Get each unique combination of populations
pops <- unique(inp[c("pop1", "pop2")])
for (p in 1:nrow(pops)){
combo <- pops[p,]
thisPop <- subset(inp, pop1 == combo$pop1[[1]] & pop2 == combo$pop2[[1]])
# Plot stats along all chromosomes:
popPlot <- ggplot(thisPop, aes(window_pos_1, avg_dxy, color=chrOrder)) +
geom_point()+
facet_grid(. ~ chrOrder)+
labs(title=paste("Dxy for", combo$pop1[[1]], "&", combo$pop2[[1]]))+
labs(x="Position of window start", y="Dxy")+
theme(legend.position = "none")+
scale_color_manual(values=rep(c("black","gray"),ceiling((length(chrOrder)/2))))+
theme_classic()+
theme(legend.position = "none")
ggsave(paste("dxyplot_", combo$pop1[[1]], "_", combo$pop2[[1]],".png", sep=""), plot = popPlot, device = "png", dpi = 300)
}
} else {
print("Dxy not found in this file")
}
Post-hoc aggregating¶
Note that if the user wishes to combine information across windows (e.g. by averaging) after the fact, they should sum the raw counts and recompute the differences/comparisons ratios, and not take an average of the summary statistics themselves.
For example, to get average pi or dxy for two windows, the correct forumla is:
(window 1 count_diffs + window 2 count_diffs) / (window 1 comparisons + window 2 comparisons)
Plotting pixy output¶
A common method of visualizing population summary statistics is to create scatter or line plots of each statistic along the genome. Examples of how to do this using the tidyverse packge in R. are shown below.
Converting pixy output files to long format¶
The output files created by pixy are in wide format by default. Plotting in the tidyverse is much easier if data is in long format, and so the function below can be used to convert a list of pixy output files to a single long format data frame.
pixy_to_long <- function(pixy_files){
pixy_df <- list()
for(i in 1:length(pixy_files)){
stat_file_type <- gsub(".*_|.txt", "", pixy_files[i])
if(stat_file_type == "pi"){
df <- read_delim(pixy_files[i], delim = "\t")
df <- df %>%
gather(-pop, -window_pos_1, -window_pos_2, -chromosome,
key = "statistic", value = "value") %>%
rename(pop1 = pop) %>%
mutate(pop2 = NA)
pixy_df[[i]] <- df
} else{
df <- read_delim(pixy_files[i], delim = "\t")
df <- df %>%
gather(-pop1, -pop2, -window_pos_1, -window_pos_2, -chromosome,
key = "statistic", value = "value")
pixy_df[[i]] <- df
}
}
bind_rows(pixy_df) %>%
arrange(pop1, pop2, chromosome, window_pos_1, statistic)
}
For example, the below code converts all the pixy files found in the folder ‘output’ to a single data frame:
pixy_folder <- "output"
pixy_files <- list.files(pixy_folder, full.names = TRUE)
pixy_df <- pixy_to_long(pixy_files)
This results in a data frame (pixy_df) that looks like this:
pop1 pop2 chromosome window_pos_1 window_pos_2 statistic value
BFS KES X 1 10000 avg_dxy 1.519710e-03
BFS KES X 1 10000 avg_wc_fst 1.224767e-01
BFS KES X 1 10000 count_comparisons 9.436012e+06
BFS KES X 1 10000 count_diffs 1.434000e+04
BFS KES X 1 10000 count_missing 2.649188e+06
BFS KES X 1 10000 no_sites 8.049000e+03
BFS KES X 1 10000 no_snps 1.310000e+02
BFS KES X 10001 20000 avg_dxy 2.453347e-03
BFS KES X 10001 20000 avg_wc_fst 1.479964e-01
Single chromosome view¶
We are often interested in local patterns of diversity along individual chromosomes. The below code plots the three main summary statistics calculated by pixy (pi, Dxy, FST) along a single chromosome. A custom labelling function is also provided to handle symbols/subscripts in the summary statistic’s labels.
# custom labeller for special characters in pi/dxy/fst
pixy_labeller <- as_labeller(c(avg_pi = "pi",
avg_dxy = "D[XY]",
avg_wc_fst = "F[ST]"),
default = label_parsed)
# plotting summary statistics along a single chromosome
pixy_df %>%
filter(chromosome == 1) %>%
filter(statistic %in% c("avg_pi", "avg_dxy", "avg_wc_fst")) %>%
mutate(chr_position = ((window_pos_1 + window_pos_2)/2)/1000000) %>%
ggplot(aes(x = chr_position, y = value, color = statistic))+
geom_line(size = 0.25)+
facet_grid(statistic ~ .,
scales = "free_y", switch = "x", space = "free_x",
labeller = labeller(statistic = pixy_labeller,
value = label_value))+
xlab("Position on Chromosome (Mb)")+
ylab("Statistic Value")+
theme_bw()+
theme(panel.spacing = unit(0.1, "cm"),
strip.background = element_blank(),
strip.placement = "outside",
legend.position = "none")+
scale_x_continuous(expand = c(0, 0))+
scale_y_continuous(expand = c(0, 0))+
scale_color_brewer(palette = "Set1")
This results in the following plot :

A genome-wide plot of summary statistics¶
We can also visualize patterns of diversity at the genome wide scale. While finer scale patterns are lost, this can be useful for identifying chrosomal scale variation (e.g. depressed diversity on sex chromosomes), or visualizing the distribution of loci of interest (e.g. FST outliers, or GWAS hits). Some common features of these types of plots (alterating chromosome colors, enforced chromosome order, axis limits) are included.
# create a custom labeller for special characters in pi/dxy/fst
pixy_labeller <- as_labeller(c(avg_pi = "pi",
avg_dxy = "D[XY]",
avg_wc_fst = "F[ST]"),
default = label_parsed)
# plotting summary statistics across all chromosomes
pixy_df %>%
mutate(chrom_color_group = case_when(as.numeric(chromosome) %% 2 != 0 ~ "even",
chromosome == "X" ~ "even",
TRUE ~ "odd" )) %>%
mutate(chromosome = factor(chromosome, levels = c(1:22, "X", "Y"))) %>%
filter(statistic %in% c("avg_pi", "avg_dxy", "avg_wc_fst")) %>%
ggplot(aes(x = (window_pos_1 + window_pos_2)/2, y = value, color = chrom_color_group))+
geom_point(size = 0.5, alpha = 0.5, stroke = 0)+
facet_grid(statistic ~ chromosome,
scales = "free_y", switch = "x", space = "free_x",
labeller = labeller(statistic = pixy_labeller,
value = label_value))+
xlab("Chromsome")+
ylab("Statistic Value")+
scale_color_manual(values = c("grey50", "black"))+
theme_classic()+
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
panel.spacing = unit(0.1, "cm"),
strip.background = element_blank(),
strip.placement = "outside",
legend.position ="none")+
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0), limits = c(0,NA))
This results in the following plot (using simulated data):

How should I cite pixy?¶
If you use pixy in your research, please cite the manuscript below, as well the Zenodo DOI of specific version of pixy used for your project.
Manuscript: Korunes, K.L. and Samuk, K. (2021), pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data. Molecular Ecology Resources. Accepted Author Manuscript. https://doi.org/10.1111/1755-0998.13326
Zenodo DOI for various versions of pixy: Go to https://zenodo.org/record/4432294 and find the DOI that matches the version used (the current version is shown first).