Welcome to pydamage’s documentation!
Homepage: github.com/maxibor/pydamage

PyDamage
Pydamage, is a Python software to automate the process of contig damage identification and estimation. After modelling the ancient DNA damage using the C to T transitions, Pydamage uses a likelihood ratio test to discriminate between truly ancient, and modern contigs originating from sample contamination.
Installation
With conda (recommended)
conda install -c bioconda pydamage
With pip
pip install pydamage
Install from source to use the development version
Using pip
pip install git+ssh://git@github.com/maxibor/pydamage.git@dev
By cloning in a dedicated conda environment
git clone git@github.com:maxibor/pydamage.git
cd pydamage
git checkout dev
conda env create -f environment.yml
conda activate pydamage
pip install -e .
Quick start
pydamage --outdir result_directory analyze aligned.bam
Note that if you specify
--outdir
, it has to be before the PyDamage subcommand, example:pydamage --outdir test filter pydamage_results.csv
CLI help
Command line interface help message
pydamage --help
Documentation
Cite
PyDamage has been published in PeerJ: 10.7717/peerj.11845
@article{borry_pydamage_2021,
author = {Borry, Maxime and Hübner, Alexander and Rohrlach, Adam B. and Warinner, Christina},
doi = {10.7717/peerj.11845},
issn = {2167-8359},
journal = {PeerJ},
language = {en},
month = {July},
note = {Publisher: PeerJ Inc.},
pages = {e11845},
shorttitle = {PyDamage},
title = {PyDamage: automated ancient damage identification and estimation for contigs in ancient DNA de novo assembly},
url = {https://peerj.com/articles/11845},
urldate = {2021-07-27},
volume = {9},
year = {2021}
}
API
- pydamage.main.pydamage_analyze(bam, wlen=30, minlen=0, show_al=False, process=1, outdir='', plot=False, verbose=False, force=False, group=False)[source]
Runs the pydamage analysis for each reference separately
- Parameters
bam (str) – Path to alignment (sam/bam/cram) file
minlen (int) – minimum reference sequence length threshold
wlen (int) – window length
show_al (bool) – print alignments representations
process (int) – Number of processes for parellel computing
outdir (str) – Path to output directory
verbose (bool) – verbose mode
force (bool) – force overwriting of results directory
- Returns
pandas DataFrame containg pydamage results
- Return type
pd.DataFrame
CLI
To access the help menu:
$ pydamage --help
The list of arguments of options is detailed below
pydamage
pydamage [OPTIONS] COMMAND [ARGS]...
Options
- --version
Show the version and exit.
- -o, --outdir <outdir>
Output directory
- Default
pydamage_results
analyze
BAM: path to BAM/SAM/CRAM alignment file. MD tags need to be set.
pydamage analyze [OPTIONS] BAM
Options
- -w, --wlen <wlen>
Window length (in bp) for damage modeling
- Default
20
- -m, --minlen <minlen>
Minimum of length of reference sequence to consider
- Default
0
- -p, --process <process>
Number of processes for parallel computing
- Default
2
- -s, --show_al
Display alignments representations
- -pl, --plot
Write damage fitting plots to disk
- -vv, --verbose
Verbose mode
- -f, --force
Force overwriting of results directory
- -g, --group
Use entire BAM file as single reference for analysis (ignore reference headers)
Arguments
- BAM
Required argument
cite
Get pydamage citation in bibtex format
pydamage cite [OPTIONS]
filter
CSV: path to PyDamage result file
pydamage filter [OPTIONS] CSV
Options
- -t, --threshold <threshold>
Predicted accuracy filtering threshold. Set to 0 for finding threshold with kneed method
- Default
0.5
Arguments
- CSV
Required argument
Output
Pydamage generates both a tabular and a visual output.
The tabular outputs are comma-separated file (.csv
) with the following columns, for each analysed reference:
pydamage_results.csv
reference
: name of the reference genome/contigpredicted_accuracy
: Predicted accuracy of Pydamage prediction, from the GLM modellingnull_model_p0
: parameterp0
of the null modelnull_model_p0_stdev
: standard error of the null model paramaterp0
damage_model_p
: parameterp
of the damage modeldamage_model_p_stdev
: standard error of the parameterp
of the damage modeldamage_model_pmin
: paramaterp_min
of the damage model. This is the modelled damage baselinedamage_model_pmin_stdev
: standard error of the paramaterp_min
of the damage modeldamage_model_pmax
: paramaterp_max
of the damage model. This is the modelled amount of damage on the 5’ end.damage_model_pmax_stdev
: standard error of the paramaterp_max
of the damage modelpvalue
: p-value calculated from the likelihood-ratio test-statistic using a chi-squared distributionqvalue
: p-value corrected for multiple testing using Benjamini-Hochberg procedure. Only computed when multiple references are usedRMSE
: residual mean standard error of the model fit of the damage modelnb_reads_aligned
: number of aligned readscoverage
: average coverage along the reference genomeCtoT-N
: Proportion of CtoT substitutions observed at positionN
from 5’ endGtoA-N
: Proportion of GtoA substitutions observed at positionN
from 5’
pydamage_filtered_results.csv
Same file as above, but with contigs filtered with qvalue <= 0.05
and predicted_accuracy >= threshold
with a user defined filtering threshold (default = 0.5), or determined with the kneedle method.
Plots
The visual output are PNG files, one per reference contig. They show the frequency of observed C to T, and G to A transition at the 5’ end of the sequencing data and overlay it with the fitted models for both the null and the damage model, including 95% confidence intervals. Furthermore, it provides a “residuals versus fitted” plot to help evaluate the fit of the pydamage damage model. Finally, the plot contains informtion on the average coverage along the reference and the p-value calculated from the likelihood-ratio test-statistic using a chi-squared distribution.
The visual output is only produced when using the
--plot
flag
Example
Tabular ouput
Visual output
Troubleshooting
My alignment files don’t have a MD tag
You can use samtools calmd to set the MD tag
Example:
samtools calmd -b alignment.bam reference.fasta > aln.bam