Understanding pixy output
Output file contents
pixy writes one output file per summary statistic that you request.
Each file is a tab-separated table, named
[output_prefix]_[stat].txt (e.g. pixy_pi.txt,
pixy_watterson_theta.txt). The columns in each file are documented
below.
Within-population nucleotide diversity (pi)
File: [prefix]_pi.txt
popThe ID of the population from the populations file.
chromosomeThe chromosome or contig.
window_pos_1First position of the genomic window.
window_pos_2Last position of the genomic window.
avg_piAverage per-site nucleotide diversity for the window.
pixycomputes the weighted average nucleotide diversity per site over all sites in the window, where weights are determined by the number of genotyped samples at each site.no_sitesTotal number of sites in the window with at least one valid genotype. This is included for the user and is not directly used in any calculation.
count_diffsRaw number of pairwise differences between all genotypes in the window. This is the numerator of
avg_pi.count_comparisonsRaw number of non-missing pairwise comparisons between all genotypes in the window. This is the denominator of
avg_pi.count_missingRaw number of missing pairwise comparisons in the window.
Between-population nucleotide divergence (dxy)
File: [prefix]_dxy.txt
pop1,pop2The IDs of the two populations being compared.
chromosome,window_pos_1,window_pos_2The chromosome and window coordinates (as for
pi).avg_dxyAverage per-site nucleotide divergence for the window.
no_sitesNumber of sites in the window with at least one valid genotype in both populations.
count_diffs,count_comparisons,count_missingRaw numerator, denominator and missing-comparison counts, defined analogously to the
pifile but across the two populations.
FST (fst)
File: [prefix]_fst.txt
pop1,pop2The IDs of the two populations being compared.
chromosome,window_pos_1,window_pos_2Window coordinates.
avg_wc_fstoravg_hudson_fstThe window-averaged FST, per SNP (not per site). Which column name you see depends on
--fst_type:wc(Weir & Cockerham 1984, the default) producesavg_wc_fst;hudson(Hudson 1992 / Bhatia et al. 2013) producesavg_hudson_fst.no_snpsTotal number of variable sites (SNPs) in the window.
wc_fst_a,wc_fst_b,wc_fst_cPresent when
--fst_componentsand--fst_type wcare used. These are the summed Weir & Cockerham variance components for all SNPs in the window.hudson_fst_num,hudson_fst_denPresent when
--fst_componentsand--fst_type hudsonare used. These are the summed numerator and denominator terms for all SNPs in the window.
Watterson's θ (watterson_theta)
File: [prefix]_watterson_theta.txt
Watterson's θ is an estimator of the population mutation rate computed
from the number of segregating sites. The unbiased estimator implemented
in pixy corrects for missing data and arbitrary ploidy. See Bailey,
Stevison & Samuk (2025) for the derivation.
popThe ID of the population.
chromosome,window_pos_1,window_pos_2Window coordinates.
avg_watterson_thetaPer-site Watterson's θ for the window.
no_sitesTotal number of sites in the window with at least one valid genotype in the focal population.
raw_watterson_thetaSum of per-site θ contributions over the window, used as the numerator of
avg_watterson_theta.no_var_sitesNumber of segregating (variant) sites in the window that contributed to the estimate.
weighted_no_sitesAuxiliary effective-site count that downweights sites by the fraction of observed alleles. This column is provided as a missingness diagnostic and is not used as the denominator of
avg_watterson_theta.
Tajima's D (tajima_d)
File: [prefix]_tajima_d.txt
Tajima's D contrasts two estimators of θ — π (based on pairwise
differences) and Watterson's θ (based on segregating sites) — to detect
departures from neutrality. pixy reports the unbiased estimator
described in Bailey, Stevison & Samuk (2025), which handles missing
data correctly.
popThe ID of the population.
chromosome,window_pos_1,window_pos_2Window coordinates.
tajima_dTajima's D for the window.
no_sitesTotal number of sites in the window with at least one valid genotype.
raw_pi,raw_watterson_thetaThe raw, unscaled π and Watterson's θ component sums used to compute D (the numerator is
raw_pi - raw_watterson_theta).tajima_d_stdevStandard deviation of the D statistic over the window (the denominator).
tajima_d_s_countsOptional column emitted only with
--tajima_components. This is a comma-separated list ofobserved_alleles:segregating_sitespairs used to recomputetajima_d_stdevexactly when aggregating windows after runningpixy.
Working with pixy output data
For plotting workflows (long-format conversion, multi-statistic panels, genome-wide plots) see Plotting pixy output.
Post-hoc aggregating
If you want to combine information across windows after the fact (e.g. by averaging), do not simply average the per-window summary statistics. Instead, sum the raw counts or components and recompute the statistic.
Pi (pi)
For π, sum count_diffs and count_comparisons across windows, then
divide:
sum(count_diffs) / sum(count_comparisons)
Dxy (dxy)
For dxy, use the same ratio as π:
sum(count_diffs) / sum(count_comparisons)
FST (fst)
For FST, run with --fst_components and recompute from the
summed estimator components. For Weir & Cockerham FST:
sum(wc_fst_a) / (sum(wc_fst_a) + sum(wc_fst_b) + sum(wc_fst_c))
For Hudson FST:
sum(hudson_fst_num) / sum(hudson_fst_den)
Watterson's θ (watterson_theta)
Watterson's θ can also be aggregated across windows. Sum
raw_watterson_theta across windows and divide by the sum of
no_sites:
sum(raw_watterson_theta) / sum(no_sites)
For Watterson's θ, no_var_sites and weighted_no_sites can also be
summed across windows as descriptive columns, but neither is the
denominator of avg_watterson_theta.
Tajima's D (tajima_d)
Tajima's D cannot be exactly aggregated post hoc from the default output
columns because tajima_d_stdev is not additive. Do not average
tajima_d values across windows. To enable exact post-hoc aggregation,
run with --tajima_components and sum tajima_d_s_counts by observed
allele count across the windows being combined.
Each tajima_d_s_counts entry is one or more
observed_alleles:segregating_sites pairs. Here observed_alleles is
the number of called alleles at those segregating sites, not the number of
diploid individuals. For example, if two windows report
tajima_d_s_counts as 8:3,10:12 and 8:2,12:5, combine them as
8:5,10:12,12:5 before recomputing the denominator.
This R helper mirrors the denominator calculation used by pixy. Pass
aggregate_tajima_d() a data frame containing the windows to combine:
parse_tajima_d_s_counts <- function(value) {
if (length(value) == 0 || is.na(value)) {
return(setNames(numeric(), character()))
}
value <- as.character(value)
if (value == "" || value == "NA") {
return(setNames(numeric(), character()))
}
counts <- setNames(numeric(), character())
for (item in strsplit(value, ",", fixed = TRUE)[[1]]) {
pair <- strsplit(item, ":", fixed = TRUE)[[1]]
n <- pair[1]
old <- counts[n]
if (is.na(old)) {
old <- 0
}
counts[n] <- old + as.numeric(pair[2])
}
counts
}
combine_tajima_d_s_counts <- function(values) {
total <- setNames(numeric(), character())
for (value in values) {
counts <- parse_tajima_d_s_counts(value)
for (n in names(counts)) {
old <- total[n]
if (is.na(old)) {
old <- 0
}
total[n] <- old + counts[n]
}
}
total
}
calc_tajima_d_stdev <- function(s_counts) {
stdev <- 0
for (n_name in names(s_counts)) {
n <- as.integer(n_name)
s <- as.numeric(s_counts[n_name])
if (is.na(n) || n < 2 || s <= 0) {
next
}
i <- seq_len(n - 1)
a1 <- sum(1 / i)
a2 <- sum(1 / (i^2))
b1 <- (n + 1) / (3 * (n - 1))
b2 <- 2 * (n^2 + n + 3) / (9 * n * (n - 1))
c1 <- b1 - (1 / a1)
c2 <- b2 - ((n + 2) / (a1 * n)) + (a2 / (a1^2))
e1 <- c1 / a1
e2 <- c2 / (a1^2 + a2)
stdev <- stdev + sqrt((e1 * s) + (e2 * s * (s - 1)))
}
stdev
}
aggregate_tajima_d <- function(rows) {
raw_pi <- sum(rows$raw_pi, na.rm = TRUE)
raw_watterson_theta <- sum(rows$raw_watterson_theta, na.rm = TRUE)
s_counts <- combine_tajima_d_s_counts(rows$tajima_d_s_counts)
tajima_d_stdev <- calc_tajima_d_stdev(s_counts)
tajima_d <- if (tajima_d_stdev <= 0) {
NA_real_
} else {
(raw_pi - raw_watterson_theta) / tajima_d_stdev
}
data.frame(
tajima_d = tajima_d,
no_sites = sum(rows$no_sites, na.rm = TRUE),
raw_pi = raw_pi,
raw_watterson_theta = raw_watterson_theta,
tajima_d_stdev = tajima_d_stdev
)
}
The final calculation is:
(sum(raw_pi) - sum(raw_watterson_theta)) / recomputed_tajima_d_stdev
If recomputed_tajima_d_stdev is zero, the aggregated Tajima's D is
undefined and should be reported as NA. Sum no_sites across windows
for the aggregated site count. The important detail is that
tajima_d_stdev itself is not summable; tajima_d_s_counts is the
summable denominator component.
When pixy internally chunks a large requested window, it uses the same
segregating-site counts by observed allele count to recompute
tajima_d_stdev for the requested output window.