Changelog
Explanations of major changes to pixy are listed below. For up-to-date
info on minor versions and bugfixes, see the release notes on GitHub:
https://github.com/ksamuk/pixy/releases
pixy 2.0.0
pixy 2.0 is a major release that adds two new estimators, broadens
the range of organisms pixy can be used on, and introduces support
for multiallelic sites. The core π / dxy / FST
behaviour is preserved and remains backward-compatible with 1.x for
biallelic, diploid input.
If you use the new Watterson's θ or Tajima's D estimators, please cite the companion paper:
Bailey, N., Stevison, L., & Samuk, K. (2025). Correcting for bias in estimates of θW and Tajima's D from missing data in next-generation sequencing. Molecular Ecology Resources, e14104. https://doi.org/10.1111/1755-0998.14104
New features
Unbiased estimators of Watterson's θ and Tajima's *D*. Pass
--stats watterson_thetaand/or--stats tajima_dto compute them. Both correct for missing data the same waypixycorrects π and dxy. See Understanding pixy output for the output column reference.Multiallelic site support. Pass
--include_multiallelic_snpsto include sites with more than two alleles. Disabled by default (biallelic mode is slightly faster).Arbitrary and variable ploidy.
pixynow handles organisms of any ploidy, and ploidy may vary across samples and chromosomes (so sex chromosomes and mixed-ploidy datasets work without special handling).CSI index support. Both
.tbi(tabix) and.csi(bcftools index) VCF indexes are accepted.Hudson's FST. Pass
--fst_type hudsonto use the Hudson (1992) / Bhatia et al. (2013) estimator instead of the default Weir & Cockerham (1984) estimator.
Bug fixes
Multiallelic-site handling has been corrected. Previously, sites with more than two alleles could be counted incorrectly in FST computations.
Fixed the standard-deviation calculation used for Tajima's D.
Fixed dxy and FST for haploid input.
Fixed Watterson's θ for haploid input.
The check for invariant sites is now more permissive about VCF formatting and no longer false-positives on valid all-sites VCFs (#185).
Suppress noisy warnings from
scikit-allel(#183).Tabix index
mtimeis refreshed when needed to avoidindex older than datawarnings on some filesystems.
Project / packaging
Build system migrated to Poetry.
Strict static type-checking with mypy.
rufffor linting and formatting.Continuous integration runs the full test suite on every PR.
Python 3.9, 3.10 and 3.11 are supported.
pixy 1.0.0
To coincide with the publication of the pixy manuscript, we're
very happy to announce the release of pixy version 1.0.0.
This was a major update to pixy and included a number of major
performance increases, new features, simplifications, and many minor
fixes. Note that this version contains breaking changes, and old
pipelines will need to be updated. We have also validated that the
estimates of π, dxy and FST produced by 1.0.0 are
identical to those of 0.95.02 (the version used in the manuscript).
Summary of major changes
All calculations are now much faster and natively parallelizable.
Memory usage vastly reduced.
BED and sites file support allows huge flexibility in windows / targeting sites of different classes of genomic elements.
Genotype filtration has been removed.
No change in the core summary statistics (π, dxy, FST) produced by
pixy.htslibis now a hard dependency, and must be installed separately.VCFs must be compressed with
bgzipand indexed withtabix(from htslib) before being used withpixy.
The performance increase and stability of numerical results are shown in the following plots:
Detailed changelog
Major changes
pixycalculations can now be fully parallelized by specifying--n_cores [number of cores]at the command line.Implemented using the
multiprocessingmodule, which is now a hard dependency.Supported under both Linux and macOS (using fork and spawn modes respectively).
Many of the core computations have been vectorized with NumPy, resulting in significant performance gains.
Memory usage is now much lower, more intelligently handled, and configurable by the user via the
--chunk_sizeargument.Large windows (e.g. whole chromosomes) are dynamically split into chunks and reassembled after summarization.
Small windows are grouped into larger chunks to prevent I/O bottlenecks associated with frequently re-reading the source VCF.
New features
Support for BED files specifying windows over which to calculate π / dxy / FST. These windows can be heterogeneous in size, enabling precise matching of
pixyoutput with the output of other programs.Support for a tab-separated 'sites file' specifying sites (CHROM, POS) where summary statistics should be exclusively calculated. This also enables, for example, estimates of π using only 4-fold degenerate sites or for a particular class of genes.
Basic support for site-level statistics (1 bp scale, though much slower than windowed statistics).
Removed features
pixyno longer makes use of a Zarr database for storing on-disk intermediate genotype information. We instead now perform random access of the VCF via tabix from htslib as implemented inscikit-allel. As such, htslib is now a hard dependency. We think tabix is a much more flexible system for many datasets, and the performance differences are negligible (and offset by the new performance features in v1.0). VCFs will need to be compressed withbgzipand indexed withtabixbefore usingpixy.Other than requiring all variants to be biallelic SNPs,
pixyno longer performs filtration of any kind. We decided that filtration was outside the scope of the functionality we wantedpixyto have. There are already many excellent tools that perform filtration, and pre-filtering creates a filtered VCF that can be used for other analyses. We now strongly recommend that users pre-filter their invariant sites VCFs using VCFtools and/or BCFtools. We provide an example shell script with this functionality (retaining invariant sites as required) as a template for users to edit for their needs.
Minor updates
The pre-calculation checks performed by
pixyare now more extensive and systematic.The method for calculating the number of valid sites has been slightly adjusted to be more accurate (this was calculated independently of the π / dxy / FST statistics).
We've refactored and restructured much of the code, with a focus on increased functionalization. This should make community contributions and future updates much easier.
To reduce confusion, output prefix and output folder are now separate arguments.
The documentation for
pixyhas been extensively updated to reflect the new changes in version 1.0.0.
Other bugfixes
Total computation time is now properly displayed.
For FST: regions with no variant sites will now have
NAin the output file, instead of not being represented.