API¶

Pipelines¶

ShearSplink¶

class pyim.align.pipelines.ShearSplinkPipeline(transposon_path, bowtie_index_path, linker_path=None, contaminant_path=None, min_length=15, min_support=2, min_mapq=23, merge_distance=None, bowtie_options=None, min_overlaps=None, error_rates=None)[source]¶

ShearSplink pipeline.

Analyzes (single-end) sequencing data that was prepared using the ShearSplink protocol. Sequence reads are expected to have the following structure:

[Transposon][Genomic][Linker]

Here, transposon refers to the flanking part of the transposon sequence, linker to the flanking linker sequence and genomic to the genomic DNA located in between (which varies per insertion). The linker sequence is optional and may be omitted if the linker is not included in sequencing.

The pipeline essentially performs the following steps:

If contaminants are provided, sequence reads are filtered (using Cutadapt) for the contaminant sequences.

The remaining reads are trimmed to remove the transposon and linker sequences, leaving only genomic sequences. Reads without the transposon/linker sequences are dropped, as we cannot be certain of their origin. (Note that the linker is optional and is only trimmed if a linker is given).

The genomic reads are aligned to the reference genome.

The resulting alignment is used to identify insertions.

Note that this pipeline does NOT support multiplexed datasets (which is the default output of the ShearSplink protocol). For multiplexed datasets, use the MultiplexedShearSplinkPipeline.

Parameters:

transposon_path (Path) – Path to the (flanking) transposon sequence (fasta).
bowtie_index_path (Path) – Path to the bowtie index.
linker_path (Path) – Path to the linker sequence (fasta).
contaminant_path (Path) – Path to file containing contaminant sequences (fasta). If provided, sequences are filtered for these sequences before extracting genomic sequences for alignment.
min_length (int) – Minimum length for genomic reads to be kept for alignment.
min_support (int) – Minimum support for insertions to be kept in the final output.
min_mapq (int) – Minimum mapping quality of alignments to be used for identifying insertions.
merge_distance (int) – Maximum distance within which insertions are merged. Used to merge insertions that occur within close vicinity, which is typically due to slight variations in alignments.
bowtie_options (Dict[str, Any]) – Dictionary of extra options for Bowtie.
min_overlaps (Dict[str, int]) – Minimum overlap required to recognize the transposon, linker and contaminant sequences (see Cutadapts documentation for more information). Keys of the dictionary indicate to which sequence the overlap corresponds and should be one of the following: linker, transposon or contaminant.
error_rates (Dict[str, float]) – Maximum error rate to use when recognizing transposon, linker and contaminant sequences (see Cutadapts documentation for more information). Keys should be the same as for min_overlaps.

class pyim.align.pipelines.MultiplexedShearSplinkPipeline(transposon_path, bowtie_index_path, barcode_path, barcode_mapping=None, linker_path=None, contaminant_path=None, min_length=15, min_support=2, min_mapq=23, merge_distance=0, bowtie_options=None, min_overlaps=None, error_rates=None)[source]¶

ShearSplink pipeline supporting multiplexed reads.

Analyzes multiplexed (single-end) sequencing data that was prepared using the ShearSplink protocol. Sequence reads are expected to have the following structure:

[Barcode][Transposon][Genomic][Linker]

Here, the transposon, genomic and linker sequences are the same as for the ShearSplinkPipeline. The barcode sequence is an index that indicates which sample the read originated for.

Barcode sequences should be provided using the barcode_path argument. The optional barcode_mapping argument can be used to map barcodes to sample names.

Parameters:

transposon_path (Path) – Path to the (flanking) transposon sequence (fasta).
bowtie_index_path (Path) – Path to the bowtie index.
barcode_path – Path to barcode sequences (fasta).
barcode_mapping (Path) – Path to a tsv file specifying a mapping from barcodes to sample names. Should contain sample and barcode columns.
linker_path (Path) – Path to the linker sequence (fasta).
contaminant_path (Path) – Path to file containing contamintant sequences (fasta). If provided, sequences are filtered for these sequences before extracting genomic sequences for alignment.
min_length (int) – Minimum length for genomic reads to be kept for alignment.
min_support (int) – Minimum support for insertions to be kept in the final output.
min_mapq (int) – Minimum mapping quality of alignments to be used for identifying insertions.
merge_distance (int) – Maximum distance within which insertions are merged. Used to merge insertions that occur within close vicinity, which is typically due to slight variations in alignments.
bowtie_options (Dict[str, Any]) – Dictionary of extra options for Bowtie.
min_overlaps (Dict[str, int]) – Minimum overlap required to recognize the transposon, linker and contamintant sequences (see Cutadapts documentation for more information). Keys of the dictionary indicate to which sequence the overlap corresponds and should be one of the following: linker, transposon or contaminant.
error_rates (Dict[str, float]) – Maximum error rate to use when recognizing transposon, linker and contamintant sequences (see Cutadapts documentation for more information). Keys should be the same as for min_overlaps.

Nextera¶

class pyim.align.pipelines.NexteraPipeline(transposon_path, bowtie_index_path, bowtie_options=None, min_length=15, min_support=2, min_mapq=23, merge_distance=None, threads=1)[source]¶

Nextera-based transposon pipeline.

Analyzes paired-end sequence data that was prepared using a Nextera-based protocol. Sequence reads are expected to have the following structure:

Mate 1:
    [Genomic]

Mate 2:
    [Transposon][Genomic]

Here, transposon refers to the flanking part of the transposon sequence and genomic refers to the genomic DNA located between the transposon sequence and the used adapt sequence. Note that the adapter itself is not sequenced and therefore not part of the reads. However, the end of Mate 1 is considered to terminate at the adapter and as such represents the breakpoint between the genomic DNA and the adapter.

The pipeline essentially performs the following steps:

Mates are trimmed to remove the transposon sequence, dropping any reads not containing the transposon.

The remaining mates are trimmed to remove any sequences from the Nextera transposase.

The trimmed mates are aligned to the reference genome.

The resulting alignment is used to identify insertions.

Parameters:

transposon_path (Path) – Path to the (flanking) transposon sequence (fasta).
bowtie_index_path (Path) – Path to the bowtie index.
bowtie_options (Dict[str, Any]) – Dictionary of extra options for Bowtie.
min_length (int) – Minimum length for genomic reads to be kept for alignment.
min_support (int) – Minimum support for insertions to be kept in the final output.
min_mapq (int) – Minimum mapping quality of alignments to be used for identifying insertions.
merge_distance (int) – Maximum distance within which insertions are merged. Used to merge insertions that occur within close vicinity, which is typically due to slight variations in alignments.
threads (int) – The number of threads to use for the alignment.