API¶
Pipelines¶
ShearSplink¶
-
class
pyim.align.pipelines.
ShearSplinkPipeline
(transposon_path, bowtie_index_path, linker_path=None, contaminant_path=None, min_length=15, min_support=2, min_mapq=23, merge_distance=None, bowtie_options=None, min_overlaps=None, error_rates=None)[source]¶ ShearSplink pipeline.
Analyzes (single-end) sequencing data that was prepared using the ShearSplink protocol. Sequence reads are expected to have the following structure:
[Transposon][Genomic][Linker]
Here,
transposon
refers to the flanking part of the transposon sequence,linker
to the flanking linker sequence andgenomic
to the genomic DNA located in between (which varies per insertion). The linker sequence is optional and may be omitted if the linker is not included in sequencing.The pipeline essentially performs the following steps:
- If contaminants are provided, sequence reads are filtered (using Cutadapt) for the contaminant sequences.
- The remaining reads are trimmed to remove the transposon and linker sequences, leaving only genomic sequences. Reads without the transposon/linker sequences are dropped, as we cannot be certain of their origin. (Note that the linker is optional and is only trimmed if a linker is given).
- The genomic reads are aligned to the reference genome.
- The resulting alignment is used to identify insertions.
Note that this pipeline does NOT support multiplexed datasets (which is the default output of the ShearSplink protocol). For multiplexed datasets, use the
MultiplexedShearSplinkPipeline
.Parameters: - transposon_path (Path) – Path to the (flanking) transposon sequence (fasta).
- bowtie_index_path (Path) – Path to the bowtie index.
- linker_path (Path) – Path to the linker sequence (fasta).
- contaminant_path (Path) – Path to file containing contaminant sequences (fasta). If provided, sequences are filtered for these sequences before extracting genomic sequences for alignment.
- min_length (int) – Minimum length for genomic reads to be kept for alignment.
- min_support (int) – Minimum support for insertions to be kept in the final output.
- min_mapq (int) – Minimum mapping quality of alignments to be used for identifying insertions.
- merge_distance (int) – Maximum distance within which insertions are merged. Used to merge insertions that occur within close vicinity, which is typically due to slight variations in alignments.
- bowtie_options (Dict[str, Any]) – Dictionary of extra options for Bowtie.
- min_overlaps (Dict[str, int]) – Minimum overlap required to recognize the transposon, linker and
contaminant sequences (see Cutadapts documentation for more
information). Keys of the dictionary indicate to which sequence the
overlap corresponds and should be one of the following:
linker
,transposon
orcontaminant
. - error_rates (Dict[str, float]) – Maximum error rate to use when recognizing transposon, linker and
contaminant sequences (see Cutadapts documentation for more
information). Keys should be the same as for
min_overlaps
.
-
class
pyim.align.pipelines.
MultiplexedShearSplinkPipeline
(transposon_path, bowtie_index_path, barcode_path, barcode_mapping=None, linker_path=None, contaminant_path=None, min_length=15, min_support=2, min_mapq=23, merge_distance=0, bowtie_options=None, min_overlaps=None, error_rates=None)[source]¶ ShearSplink pipeline supporting multiplexed reads.
Analyzes multiplexed (single-end) sequencing data that was prepared using the ShearSplink protocol. Sequence reads are expected to have the following structure:
[Barcode][Transposon][Genomic][Linker]
Here, the
transposon
,genomic
andlinker
sequences are the same as for theShearSplinkPipeline
. Thebarcode
sequence is an index that indicates which sample the read originated for.Barcode sequences should be provided using the
barcode_path
argument. The optionalbarcode_mapping
argument can be used to map barcodes to sample names.Parameters: - transposon_path (Path) – Path to the (flanking) transposon sequence (fasta).
- bowtie_index_path (Path) – Path to the bowtie index.
- barcode_path – Path to barcode sequences (fasta).
- barcode_mapping (Path) – Path to a tsv file specifying a mapping from barcodes to sample names.
Should contain
sample
andbarcode
columns. - linker_path (Path) – Path to the linker sequence (fasta).
- contaminant_path (Path) – Path to file containing contamintant sequences (fasta). If provided, sequences are filtered for these sequences before extracting genomic sequences for alignment.
- min_length (int) – Minimum length for genomic reads to be kept for alignment.
- min_support (int) – Minimum support for insertions to be kept in the final output.
- min_mapq (int) – Minimum mapping quality of alignments to be used for identifying insertions.
- merge_distance (int) – Maximum distance within which insertions are merged. Used to merge insertions that occur within close vicinity, which is typically due to slight variations in alignments.
- bowtie_options (Dict[str, Any]) – Dictionary of extra options for Bowtie.
- min_overlaps (Dict[str, int]) – Minimum overlap required to recognize the transposon, linker and
contamintant sequences (see Cutadapts documentation for more
information). Keys of the dictionary indicate to which sequence the
overlap corresponds and should be one of the following:
linker
,transposon
orcontaminant
. - error_rates (Dict[str, float]) – Maximum error rate to use when recognizing transposon, linker and
contamintant sequences (see Cutadapts documentation for more
information). Keys should be the same as for
min_overlaps
.
Nextera¶
-
class
pyim.align.pipelines.
NexteraPipeline
(transposon_path, bowtie_index_path, bowtie_options=None, min_length=15, min_support=2, min_mapq=23, merge_distance=None, threads=1)[source]¶ Nextera-based transposon pipeline.
Analyzes paired-end sequence data that was prepared using a Nextera-based protocol. Sequence reads are expected to have the following structure:
Mate 1: [Genomic] Mate 2: [Transposon][Genomic]
Here,
transposon
refers to the flanking part of the transposon sequence andgenomic
refers to the genomic DNA located between the transposon sequence and the used adapt sequence. Note that the adapter itself is not sequenced and therefore not part of the reads. However, the end of Mate 1 is considered to terminate at the adapter and as such represents the breakpoint between the genomic DNA and the adapter.The pipeline essentially performs the following steps:
- Mates are trimmed to remove the transposon sequence, dropping any reads not containing the transposon.
- The remaining mates are trimmed to remove any sequences from the Nextera transposase.
- The trimmed mates are aligned to the reference genome.
- The resulting alignment is used to identify insertions.
Parameters: - transposon_path (Path) – Path to the (flanking) transposon sequence (fasta).
- bowtie_index_path (Path) – Path to the bowtie index.
- bowtie_options (Dict[str, Any]) – Dictionary of extra options for Bowtie.
- min_length (int) – Minimum length for genomic reads to be kept for alignment.
- min_support (int) – Minimum support for insertions to be kept in the final output.
- min_mapq (int) – Minimum mapping quality of alignments to be used for identifying insertions.
- merge_distance (int) – Maximum distance within which insertions are merged. Used to merge insertions that occur within close vicinity, which is typically due to slight variations in alignments.
- threads (int) – The number of threads to use for the alignment.