Welcome to pydamage’s documentation!

Homepage: github.com/maxibor/pydamage

pydamage logo GitHub release (latest by date including pre-releases) pydamage CI Documentation Status PyPI Conda

PyDamage

Pydamage, is a Python software to automate the process of contig damage identification and estimation. After modelling the ancient DNA damage using the C to T transitions, Pydamage uses a likelihood ratio test to discriminate between truly ancient, and modern contigs originating from sample contamination.

Installation

With pip

pip install pydamage

Install from source to use the development version

Using pip

pip install git+ssh://git@github.com/maxibor/pydamage.git@dev

By cloning in a dedicated conda environment

git clone git@github.com:maxibor/pydamage.git
cd pydamage
git checkout dev
conda env create -f environment.yml
conda activate pydamage
pip install -e .

Quick start

pydamage aligned.bam reference.fa

CLI help

Command line interface help message

pydamage --help

API

pydamage.main.analyze(bam, fasta, wlen=30, show_al=False, mini=2000, cov=0.5, process=1, outdir='', plot=False, verbose=False, force=False)[source]

Runs the pydamage analysis

Parameters
  • bam (str) – Path to alignment (sam/bam/cram) file

  • fasta (str) – Path to fasta file containing reference sequences

  • wlen (int) – window length

  • show_al (bool) – print alignments representations

  • mini (int) – Minimum numbers of reads aligned to consider contigs

  • cov (float) – Minimum coverage to consider contig

  • process (int) – Number of processes for parellel computing

  • outdir (str) – Path to output directory

  • verbose (bool) – verbose mode

  • force (bool) – force overwriting of results directory

Returns

pandas DataFrame containg pydamage results

Return type

pd.DataFrame

CLI

To access the help menu:

$ pydamage --help

The list of arguments of options is detailed below

pydamage

PyDamage: Damage parameter estimation for ancient DNA
Author: Maxime Borry
Contact: <borry[at]shh.mpg.de>
Homepage & Documentation: github.com/maxibor/pydamage

BAM: path to BAM/SAM/CRAM alignment file FASTA: path to reference FASTA file

pydamage [OPTIONS] BAM FASTA

Options

--version

Show the version and exit.

-w, --wlen <wlen>

Window length for damage modeling

Default

35

-p, --process <process>

Number of processes

Default

2

-m, --mini <mini>

Minimum reads aligned to consider reference

Default

1000

-c, --cov <cov>

Minimum coverage to consider reference

Default

8

-s, --show_al

Show alignments representations

-pl, --plot

Make the damage plots

--verbose

Verbose mode

-o, --outdir <outdir>

Output directory

Default

pydamage_results

--force

Force overwriting of results directory

Arguments

BAM

Required argument

FASTA

Required argument

Output

Pydamage generates both a tabular and a visual output.

The tabular output is a comma-separated file (.csv) with the following columns, for each analysed reference:

  • reference: name of the reference genome/contig

  • null_model_p0: parameter p0 of the null model

  • null_model_p0_stdev: standard error of the null model paramater p0

  • damage_model_p: parameter p of the damage model

  • damage_model_p_stdev: standard error of the parameter p of the damage model

  • damage_model_pmin: paramater p_min of the damage model. This is the modelled damage baseline

  • damage_model_pmin_stdev: standard error of the paramater p_min of the damage model

  • damage_model_pmax: paramater p_max of the damage model. This is the modelled amount of damage on the 5’ end.

  • damage_model_pmax_stdev: standard error of the paramater p_max of the damage model

  • pvalue: p-value calculated from the likelihood-ratio test-statistic using a chi-squared distribution

  • qvalue: p-value corrected for multiple testing using Benjamini-Hochberg procedure

  • RMSE: residual mean standard error of the model fit of the damage model

  • nb_reads_aligned: number of aligned reads

  • coverage: average coverage along the reference genome

  • CtoT-N: Proportion of CtoT substitutions observed at position N from 5’ end

  • GtoA-N: Proportion of GtoA substitutions observed at position N from 5’

  • pred_accuracy: Predicted accuracy of Pydamage prediction, from the GLM modelling

The visual output are PNG files, one per reference contig. They show the frequency of observed C to T, and G to A transitions at the 5’ end of the sequencing data and overlay it with the fitted models for both the null and the damage model, including 95% confidence intervals. Furthermore, it provides a “residuals versus fitted” plot to help evaluate the fit of the pydamage damage model. Finally, the plot contains informtion on the average coverage along the reference and the p-value calculated from the likelihood-ratio test-statistic using a chi-squared distribution.

The visual output is only produced when using the --plot flag

Example

_images/NZ_JHCB02000011.1.png

Indices and tables