Welcome to pydamage’s documentation!¶
Homepage: github.com/maxibor/pydamage
PyDamage¶
Pydamage, is a Python software to automate the process of contig damage identification and estimation. After modelling the ancient DNA damage using the C to T transitions, Pydamage uses a likelihood ratio test to discriminate between truly ancient, and modern contigs originating from sample contamination.
Installation¶
With pip¶
pip install pydamage
Install from source to use the development version¶
Using pip
pip install git+ssh://git@github.com/maxibor/pydamage.git@dev
By cloning in a dedicated conda environment
git clone git@github.com:maxibor/pydamage.git
cd pydamage
git checkout dev
conda env create -f environment.yml
conda activate pydamage
pip install -e .
Quick start¶
pydamage aligned.bam
Documentation¶
API¶
- pydamage.main.analyze_group(bam, wlen=30, show_al=False, process=1, outdir='', plot=False, verbose=False, force=False)[source]¶
Runs the pydamage analysis with all references grouped as one
- Parameters
bam (str) – Path to alignment (sam/bam/cram) file
wlen (int) – window length
show_al (bool) – print alignments representations
process (int) – Number of processes for parellel computing
outdir (str) – Path to output directory
verbose (bool) – verbose mode
force (bool) – force overwriting of results directory
- Returns
pandas DataFrame containg pydamage results
- Return type
pd.DataFrame
- pydamage.main.analyze_multi(bam, wlen=30, show_al=False, process=1, outdir='', plot=False, verbose=False, force=False)[source]¶
Runs the pydamage analysis for each reference separately
- Parameters
bam (str) – Path to alignment (sam/bam/cram) file
wlen (int) – window length
show_al (bool) – print alignments representations
process (int) – Number of processes for parellel computing
outdir (str) – Path to output directory
verbose (bool) – verbose mode
force (bool) – force overwriting of results directory
- Returns
pandas DataFrame containg pydamage results
- Return type
pd.DataFrame
CLI¶
To access the help menu:
$ pydamage --help
The list of arguments of options is detailed below
pydamage¶
BAM: path to BAM/SAM/CRAM alignment file. MD tags need to be set.
pydamage [OPTIONS] BAM
Options
- --version¶
Show the version and exit.
- -w, --wlen <wlen>¶
Window length (in bp) for damage modeling
- Default
35
- -p, --process <process>¶
Number of processes for parallel computing
- Default
2
- -o, --outdir <outdir>¶
Output directory
- Default
pydamage_results
- -s, --show_al¶
Display alignments representations
- -pl, --plot¶
Write damage fitting plots to disk
- -vv, --verbose¶
Verbose mode
- -f, --force¶
Force overwriting of results directory
- -g, --group¶
Use entire BAM file as single reference for analyis (ignore reference headers)
Arguments
- BAM¶
Required argument
Output¶
Pydamage generates both a tabular and a visual output.
The tabular output is a comma-separated file (.csv
) with the following columns, for each analysed reference:
reference
: name of the reference genome/contigpred_accuracy
: Predicted accuracy of Pydamage prediction, from the GLM modellingnull_model_p0
: parameterp0
of the null modelnull_model_p0_stdev
: standard error of the null model paramaterp0
damage_model_p
: parameterp
of the damage modeldamage_model_p_stdev
: standard error of the parameterp
of the damage modeldamage_model_pmin
: paramaterp_min
of the damage model. This is the modelled damage baselinedamage_model_pmin_stdev
: standard error of the paramaterp_min
of the damage modeldamage_model_pmax
: paramaterp_max
of the damage model. This is the modelled amount of damage on the 5’ end.damage_model_pmax_stdev
: standard error of the paramaterp_max
of the damage modelpvalue
: p-value calculated from the likelihood-ratio test-statistic using a chi-squared distributionqvalue
: p-value corrected for multiple testing using Benjamini-Hochberg procedure. Only computed when multiple references are usedRMSE
: residual mean standard error of the model fit of the damage modelnb_reads_aligned
: number of aligned readscoverage
: average coverage along the reference genomeCtoT-N
: Proportion of CtoT substitutions observed at positionN
from 5’ endGtoA-N
: Proportion of GtoA substitutions observed at positionN
from 5’
To select contigs/references with damage, you will most likely want to look at two columns:
pred_accuracy > 0.9
andqvalue <= 0.05
The visual output are PNG files, one per reference contig. They show the frequency of observed C to T, and G to A transition at the 5’ end of the sequencing data and overlay it with the fitted models for both the null and the damage model, including 95% confidence intervals. Furthermore, it provides a “residuals versus fitted” plot to help evaluate the fit of the pydamage damage model. Finally, the plot contains informtion on the average coverage along the reference and the p-value calculated from the likelihood-ratio test-statistic using a chi-squared distribution.
The visual output is only produced when using the
--plot
flag
Troubleshooting¶
My alignment files don’t have a MD tag¶
You can use samtools calmd to set the MD tag
Example:
samtools calmd -b alignment.bam reference.fasta > aln.bam