************ Example data ************ A small all-sites VCF and matching populations file are included with the ``pixy`` source tree so users can verify their installation and reproduce the figures shown in the documentation. Download ======== The files live under ``docs/example_data/`` in the ``pixy`` GitHub repository. They are deliberately tiny so they can be downloaded quickly and committed to the repo without bloating clones. .. list-table:: :header-rows: 1 :widths: 32 12 56 * - File - Size - Description * - `dmel_2pop.vcf.gz `_ - ~200 KB - bgzipped all-sites VCF, 50,000 sites on a single chromosome (``chr2L``), 20 diploid samples split across two populations. * - `dmel_2pop.vcf.gz.tbi `_ - <1 KB - tabix index for the VCF. * - `dmel_populations.txt `_ - <1 KB - Populations file: 10 samples in ``pop_A``, 10 in ``pop_B``. Or from the command line: .. code:: console BASE=https://github.com/ksamuk/pixy/raw/master/docs/example_data curl -LO $BASE/dmel_2pop.vcf.gz curl -LO $BASE/dmel_2pop.vcf.gz.tbi curl -LO $BASE/dmel_populations.txt How the data were generated =========================== The VCF was simulated with `vcfsim `_, a thin wrapper around `msprime `_ for producing all-sites VCFs. The demographic model is a single split: an ancestral population *C* splits into two derived populations *A* and *B* at ``--div_time`` generations before present. Parameters were taken from the `stdpopsim `_ *Drosophila melanogaster* species model: .. list-table:: :header-rows: 1 :widths: 30 30 40 * - Parameter - Value - Source * - Mutation rate (μ) - 5.49 × 10\ :sup:`-9` per site per generation - Keightley *et al.* 2014 (stdpopsim DroMel) * - Effective population size (N\ :sub:`e`) - 1.65 × 10\ :sup:`6` - Li & Stephan 2006 (stdpopsim DroMel) * - Divergence time - 500,000 generations - Chosen to give a moderate F\ :sub:`ST` (≈ 0.1–0.2) * - Sequence length - 50,000 bp - Single chromosome ``chr2L`` * - Samples per population - 10 diploid individuals - * - Ploidy - 2 - * - Missing data - 0% - Clean baseline; see notes below. The exact command used was: .. code:: console vcfsim \ --seed 2024 \ --replicates 1 \ --chromosome chr2L \ --sequence_length 50000 \ --ploidy 2 \ --Ne 1650000 \ --mu 5.49e-9 \ --population_mode 2 \ --div_time 500000 \ --sample_size 20 \ --percent_missing_sites 0 \ --percent_missing_genotypes 0 \ --output_file dmel_2pop The output ``dmel_2pop2024.vcf`` was then renamed, bgzipped, and indexed: .. code:: console mv dmel_2pop2024.vcf dmel_2pop.vcf bgzip dmel_2pop.vcf tabix -p vcf dmel_2pop.vcf.gz The populations file was constructed by hand based on ``vcfsim``'s sample ordering. With ``--population_mode 2 --sample_size 20``, the samples emitted are ``tsk_1`` through ``tsk_20``: the first 10 (``tsk_1``–``tsk_10``) are drawn from population *A* and the next 10 (``tsk_11``–``tsk_20``) from population *B*. See the ``vcfsim`` source for the exact assignment. Running pixy on the example =========================== .. code:: console pixy --stats pi fst dxy watterson_theta tajima_d \ --vcf dmel_2pop.vcf.gz \ --populations dmel_populations.txt \ --window_size 5000 \ --n_cores 2 \ --output_prefix dmel_example This produces ``dmel_example_pi.txt``, ``dmel_example_fst.txt``, ``dmel_example_dxy.txt``, ``dmel_example_watterson_theta.txt`` and ``dmel_example_tajima_d.txt`` in your working directory. See :doc:`output` for the meaning of each column. Expected results ---------------- With the parameters above (μ = 5.49×10\ :sup:`-9`, N\ :sub:`e` = 1.65×10\ :sup:`6`), the infinite-sites expectation for per-site heterozygosity in either daughter population is θ = 4·N\ :sub:`e`·μ ≈ 0.0362. ``vcfsim`` (via ``msprime``) uses the JC69 finite-sites mutation model by default, so the same site can be hit by repeated or back-mutations and the expected *observable* proportion of pairwise differences is slightly lower: .. parsed-literal:: π ≈ θ / (1 + 4θ/3) ≈ 0.035 and d\ :sub:`xy` between the two populations is slightly higher because of the divergence. F\ :sub:`ST` should fall in the 0.1–0.2 range. With only 50 kb of sequence and 10 individuals per population there is substantial sampling noise, so per-window estimates will vary widely; aggregating across windows (see :doc:`output`) recovers values close to the expectations. .. note:: These are theoretical values from neutral coalescent theory. Empirical π in *D. melanogaster* is lower (≈ 0.005–0.01) because selection and linkage reduce diversity. The simulation used here is neutral, so it reflects the full neutral expectation rather than what is observed in real flies. Variants of the same data ------------------------- For testing how ``pixy`` handles imperfect data, you can regenerate the VCF with missing data by changing the missingness arguments to ``vcfsim``. A common pair is uniform 5% site-level and 2% genotype-level dropout: .. code:: console vcfsim ... \ --percent_missing_sites 0.05 \ --percent_missing_genotypes 0.02 \ --output_file dmel_2pop_missing This is exactly the scenario ``pixy`` was designed for: comparing the ``avg_pi`` produced by ``pixy`` against the theoretical value above demonstrates that the unbiased estimator handles missing data correctly, while the equivalent ``VCFtools`` ``--site-pi`` call gives a deflated estimate.