Step by Step Installation and Usage

Note

pixy is currently only available for Linux and macOS systems.

1. Generate a VCF with Invariant Sites and perform filtering

If you did not already generate an 'allsites' VCF (VCF with invariant sites), see the guide here.

Note

When working with whole genome data, we suggest you generate separate invariant sites VCFs for each chromosome. This is to prevent memory limitation issues down the line. This is less of an issue for reduced representation sequencing, naturally.

We recommend using the standard tools dedicated to performing such operations on VCFs: VCFtools and BCFtools (both available on bioconda). The examples below use VCFtools because its filter flags are the most legible, but BCFtools (bcftools view, bcftools filter) can perform all of the same operations and is the more actively-maintained tool — use whichever you prefer.

Site-level filtration

A full treatment of VCF filtering for population genetics is beyond our scope here, but for a good start see: https://speciationgenomics.github.io/filtering_vcfs/.

At minimum, we recommend filtering your VCF (a) using the GATK best practices site filtration procedure (if applicable) and (b) performing further filtering for missingness, site quality score, and mean depth (minimum and maximum).

Here is an example using VCFtools. The specific values (especially for min/max-meanDP) will vary based on your dataset:

vcftools --gzvcf my_vcf.vcf.gz \
--remove-indels \
--max-missing 0.8 \
--min-meanDP 20 \
--max-meanDP 500 \
--recode --stdout | bgzip -c > my_filtered_vcf.vcf.gz

Note

GenotypeGVCFs only assigns QUAL scores to invariant sites if there are no missing genotypes. If you wish to filter on QUAL (--minQ), invariant and variant sites must be filtered separately, with the QUAL filter applied only to variant sites (see below for details).

Optional: Population genetic filters

Depending on your goal, you might also consider filtering out sites with strong HWE violations (try --hwe 0.001 with VCFtools), unusually high observed heterozygosity, or allelic depth imbalances. See this paper https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.12613 for more details on these considerations.

These last two considerations are particularly important if your study organism has high levels of gene duplication (e.g. re-diploidized after whole genome duplication as in many plant and fish species).

If your VCF contains both variant and invariant sites (as it should at this point), applying population genetic based filters will result in the loss of your invariant sites. To avoid this, filter the invariant and variant sites separately and concatenate the two resulting files. Below is an example of one way to achieve this using VCFtools and BCFtools:

#!/bin/bash
# requires bcftools/bgzip/tabix and vcftools

# create a filtered VCF containing only invariant sites
vcftools --gzvcf test.vcf.gz \
--max-maf 0 \
[add other filters for invariant sites here] \
--recode --stdout | bgzip -c > test_invariant.vcf.gz

# create a filtered VCF containing only variant sites
vcftools --gzvcf test.vcf.gz \
--mac 1 \
[add other filters for variant sites here] \
--recode --stdout | bgzip -c > test_variant.vcf.gz

# index both vcfs using tabix
tabix test_invariant.vcf.gz
tabix test_variant.vcf.gz

# combine the two VCFs using bcftools concat
bcftools concat \
--allow-overlaps \
test_variant.vcf.gz test_invariant.vcf.gz \
-O z -o test_filtered.vcf.gz

2. Install a conda distribution

If you haven't already, install Miniforge (https://github.com/conda-forge/miniforge) or Miniconda (https://docs.anaconda.com/miniconda/). We recommend Miniforge — it ships with the conda-forge channel pre-configured, is fully free for any use, and has a smaller footprint than the full Anaconda distribution. The full Anaconda distribution will also work but has more restrictive licensing for commercial/large-organization use.

3. Create a New Environment

Create and activate a new conda environment for working with pixy. pixy supports Python 3.9, 3.10 and 3.11 (3.12 is not yet supported):

conda create -n "pixy" python=3.11
conda activate pixy

4. Install pixy

Install pixy via the conda-forge channel, plus samtools from bioconda for the bgzip/tabix tools you'll need to prepare your VCF.

conda install --yes -c conda-forge pixy
conda install --yes -c bioconda samtools

(samtools pulls in htslib as a dependency, so there is no need to install it separately.)

To see a list of arguments and test the pixy installation, type:

pixy --help

Note

If you have trouble installing in a Python 3.11 environment, try rolling back to Python 3.9.

5. Create a populations file

Create a populations file. This is a headerless, tab-separated file where the first column contains sample names (exactly as represented in the VCF), and the second column contains population names (these can be anything, but should be consistent!).

For example:

ERS223827   BFS
ERS223759   BFS
ERS223750   BFS
ERS223967   AFS
ERS223970   AFS
ERS223924   AFS
ERS224300   AFS
ERS224168   KES
ERS224314   KES

6. Compress and Index your VCF

pixy requires its input VCF to be bgzip-compressed (not plain gzip) and to have a tabix index alongside it. If you have not already done so:

bgzip [your.file.vcf]
tabix -p vcf [your.file.vcf.gz]

If your file was compressed with plain gzip (for example by an earlier step in your pipeline), decompress it and re-compress with bgzip — tabix will refuse to index a plain-gzip file.

7. Run pixy

Run pixy! The example below uses the small simulated D. melanogaster VCF that ships with the docs (see Example data for details on how it was generated and what to expect from it):

pixy --stats pi fst dxy \
--vcf docs/example_data/dmel_2pop.vcf.gz \
--populations docs/example_data/dmel_populations.txt \
--window_size 2000 \
--n_cores 4 \
--chromosomes 'chr2L'

As of pixy 2.0, you can also request Watterson's θ and Tajima's D:

pixy --stats pi fst dxy watterson_theta tajima_d \
--vcf docs/example_data/dmel_2pop.vcf.gz \
--populations docs/example_data/dmel_populations.txt \
--window_size 2000 \
--n_cores 4 \
--chromosomes 'chr2L'

Note

By default, pixy ignores multiallelic sites and INDELs even if they are left in the VCF after pre-filtering. Pass --include_multiallelic_snps to include sites with more than two alleles in the calculation (new in 2.0).

8. Interpret your results

pixy writes one tab-separated file per requested statistic, named <output_prefix>_<stat>.txt (e.g. pixy_pi.txt, pixy_fst.txt). For a column-by-column description of each output file see Understanding pixy output, and for ready-to-run R recipes for plotting π, dxy, and Fst across the genome see Plotting pixy output.

9. Stay up to date

You can keep pixy up to date by re-running:

conda install --yes -c conda-forge pixy

You can check that you have the latest version via:

pixy --version

And comparing the version number to the one listed here: https://anaconda.org/conda-forge/pixy.