Welcome to pydamage’s documentation!

Homepage: github.com/maxibor/pydamage

pydamage logo GitHub release (latest by date including pre-releases) pydamage CI Documentation Status PyPI Conda

PyDamage

Pydamage, is a Python software to automate the process of contig damage identification and estimation. After modelling the ancient DNA damage using the C to T transitions, Pydamage uses a likelihood ratio test to discriminate between truly ancient, and modern contigs originating from sample contamination.

Installation

With pip

pip install pydamage

Install from source to use the development version

Using pip

pip install git+ssh://git@github.com/maxibor/pydamage.git@dev

By cloning in a dedicated conda environment

git clone git@github.com:maxibor/pydamage.git
cd pydamage
git checkout dev
conda env create -f environment.yml
conda activate pydamage
pip install -e .

Quick start

pydamage aligned.bam

CLI help

Command line interface help message

pydamage --help

API

pydamage.main.analyze_group(bam, wlen=30, show_al=False, process=1, outdir='', plot=False, verbose=False, force=False)[source]

Runs the pydamage analysis with all references grouped as one

Parameters
  • bam (str) – Path to alignment (sam/bam/cram) file

  • wlen (int) – window length

  • show_al (bool) – print alignments representations

  • process (int) – Number of processes for parellel computing

  • outdir (str) – Path to output directory

  • verbose (bool) – verbose mode

  • force (bool) – force overwriting of results directory

Returns

pandas DataFrame containg pydamage results

Return type

pd.DataFrame

pydamage.main.analyze_multi(bam, wlen=30, show_al=False, process=1, outdir='', plot=False, verbose=False, force=False)[source]

Runs the pydamage analysis for each reference separately

Parameters
  • bam (str) – Path to alignment (sam/bam/cram) file

  • wlen (int) – window length

  • show_al (bool) – print alignments representations

  • process (int) – Number of processes for parellel computing

  • outdir (str) – Path to output directory

  • verbose (bool) – verbose mode

  • force (bool) – force overwriting of results directory

Returns

pandas DataFrame containg pydamage results

Return type

pd.DataFrame

CLI

To access the help menu:

$ pydamage --help

The list of arguments of options is detailed below

pydamage

PyDamage: Damage parameter estimation for ancient DNA
Author: Maxime Borry
Contact: <borry[at]shh.mpg.de>
Homepage & Documentation: github.com/maxibor/pydamage

BAM: path to BAM/SAM/CRAM alignment file. MD tags need to be set.

pydamage [OPTIONS] BAM

Options

--version

Show the version and exit.

-w, --wlen <wlen>

Window length (in bp) for damage modeling

Default

35

-p, --process <process>

Number of processes for parallel computing

Default

2

-o, --outdir <outdir>

Output directory

Default

pydamage_results

-s, --show_al

Display alignments representations

-pl, --plot

Write damage fitting plots to disk

-vv, --verbose

Verbose mode

-f, --force

Force overwriting of results directory

-g, --group

Use entire BAM file as single reference for analyis (ignore reference headers)

Arguments

BAM

Required argument

Output

Pydamage generates both a tabular and a visual output.

The tabular output is a comma-separated file (.csv) with the following columns, for each analysed reference:

  • reference: name of the reference genome/contig

  • pred_accuracy: Predicted accuracy of Pydamage prediction, from the GLM modelling

  • null_model_p0: parameter p0 of the null model

  • null_model_p0_stdev: standard error of the null model paramater p0

  • damage_model_p: parameter p of the damage model

  • damage_model_p_stdev: standard error of the parameter p of the damage model

  • damage_model_pmin: paramater p_min of the damage model. This is the modelled damage baseline

  • damage_model_pmin_stdev: standard error of the paramater p_min of the damage model

  • damage_model_pmax: paramater p_max of the damage model. This is the modelled amount of damage on the 5’ end.

  • damage_model_pmax_stdev: standard error of the paramater p_max of the damage model

  • pvalue: p-value calculated from the likelihood-ratio test-statistic using a chi-squared distribution

  • qvalue: p-value corrected for multiple testing using Benjamini-Hochberg procedure. Only computed when multiple references are used

  • RMSE: residual mean standard error of the model fit of the damage model

  • nb_reads_aligned: number of aligned reads

  • coverage: average coverage along the reference genome

  • CtoT-N: Proportion of CtoT substitutions observed at position N from 5’ end

  • GtoA-N: Proportion of GtoA substitutions observed at position N from 5’

To select contigs/references with damage, you will most likely want to look at two columns: pred_accuracy > 0.9 and qvalue <= 0.05

The visual output are PNG files, one per reference contig. They show the frequency of observed C to T, and G to A transition at the 5’ end of the sequencing data and overlay it with the fitted models for both the null and the damage model, including 95% confidence intervals. Furthermore, it provides a “residuals versus fitted” plot to help evaluate the fit of the pydamage damage model. Finally, the plot contains informtion on the average coverage along the reference and the p-value calculated from the likelihood-ratio test-statistic using a chi-squared distribution.

The visual output is only produced when using the --plot flag

Example

pydamage_plot

Troubleshooting

My alignment files don’t have a MD tag

You can use samtools calmd to set the MD tag

Example:

samtools calmd -b alignment.bam reference.fasta > aln.bam

Indices and tables