Step by Step Installation and Usage¶
Note
pixy is currently only available for Linux and macOS systems.
1. Generate a VCF with Invariant Sites and perform filtering¶
If you did not already generate an ‘allsites’ VCF (VCF with invariant sites), see the guide here.
Note
When working with whole genome data, we suggest you generate separate invariant sites VCFs for each chromosome. This is to prevent memory limitation issues down the line. This is less of an issue for reduced representation sequencing, naturally.
We recommend using the standard tools dedicated to performing such operations on VCFs: VCFtools and BCFtools (both available on bioconda).
Site-level filtration¶
A full treatment of VCF filtering for population genetics is beyond our scope here, but for a good start see: https://speciationgenomics.github.io/filtering_vcfs/.
At minumum, we reccomend filtering your VCF (a) using the GATK best practices site filtration procedure (if applicable) and (b) performing further filtering for missingness, site quality score, and mean depth (minimum and maxmium).
Here is an example using VCFtools. The specific values (especially for min/max-meanDP) will vary based on your dataset:
vcftools --gzvcf my_vcf.vcf.gz \
--remove-indels \
--max-missing 0.8 \
--min-meanDP 20 \
--max-meanDP 500 \
--recode --stdout | gzip -c > my_filtered_vcf.vcf.gz
Note
As of GATK version 4.2.0.0, GenotypeGVCFs only assigns QUAL scores to invariant sites if there are no missing genotypes. If you wish to filter on QUAL (–minQ), invariant and variant sites will need to be filtered separately, with the QUAL filter only applied to variant sites (see below for details).
Optional: Population genetic filters¶
Depending on your goal, you might also consider filtering out sites with strong HWE violations (try –hwe 0.001 with VCFtools), unusually high observed heterozygosity, or allelic depth imbalances. See this paper https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.12613 for more details on these considerations.
These last two considerations are particularly important if your study organism has high levels of gene duplication (e.g. re-diploidized after whole genome duplication as in many plant and fish species).
If your VCF contains both variant and invariant sites (as it should at this point), applying population genetic based filters will result in the loss of your invariant sites. To avoid this, filter the invariant and variant sites separately and concatenate the two resulting files. Below is an example of one way to achieve this using VCFtool and BCFtools:
#!/bin/bash # requires bcftools/bgzip/tabix and vcftools # create a filtered VCF containing only invariant sites vcftools --gzvcf test.vcf.gz \ --max-maf 0 \ [add other filters for invariant sites here] \ --recode --stdout | bgzip -c > test_invariant.vcf.gz # create a filtered VCF containing only variant sites vcftools --gzvcf test.vcf.gz \ --mac 1 \ [add other filters for variant sites here] \ --recode --stdout | bgzip -c > test_variant.vcf.gz # index both vcfs using tabix tabix test_invariant.vcf.gz tabix test_variant.vcf.gz # combine the two VCFs using bcftools concat bcftools concat \ --allow-overlaps \ test_variant.vcf.gz test_invariant.vcf.gz \ -O z -o test_filtered.vcf.gz
2. Install Anaconda¶
If you haven’t already, install Anaconda https://docs.anaconda.com/anaconda/install/
3. Create a New Environment¶
Create and activate a new conda environment for working with pixy:
conda create --name pixy
conda activate pixy
4. Install pixy¶
Install pixy via the conda-forge channel. Also install the required htslib package from bioconda.
conda install --yes -c conda-forge pixy
conda install --yes -c bioconda htslib
To see a list of arguments and test the pixy installation, type:
pixy --help
5. Create a populations file¶
Create a populations file. This is a headerless, tab-separated file where the first column contains sample names (exactly as represented in the VCF), and the second column contains population names (these can be anything, but should be consistent!).
For example:
ERS223827 BFS
ERS223759 BFS
ERS223750 BFS
ERS223967 AFS
ERS223970 AFS
ERS223924 AFS
ERS224300 AFS
ERS224168 KES
ERS224314 KES
6. Compress and Index your VCF¶
If you have not already, use bgzip and tabix to compress and index your VCF:
bgzip [your.file.vcf]
tabix [your.file.vcf.gz]
7. Run pixy¶
Run pixy! An example is shown below.
pixy --stats pi fst dxy \
--vcf data/vcf/ag1000/chrX_36Ag_allsites.vcf.gz \
--populations data/vcf/ag1000/Ag1000_sampleIDs_popfile.txt \
--window_size 10000 \
--n_cores 4 \
--chromosomes 'X'
Note
pixy ignores non-biallelic sites and INDELs, even if they are left in the VCF after pre-filtering.
8. Profit¶
Parse the output files and enjoy your unbiased estimates of pi and dxy!
9. Stay up to date¶
You can keep pixy up to date by re-running:
conda install --yes -c conda-forge pixy
You can check that you have the latest version via:
pixy --version
And comparing the version number to the one listed here: https://anaconda.org/conda-forge/pixy.