Usage¶

Identifying insertions¶

The pyim-align command is used to identify insertions using sequence reads from targeted DNA-sequencing of insertion sites. The command provides access to various pipelines which (in essence) perform the following functions:

Reads are filtered to remove reads that do not contain the correct technical sequences (such as transposon sequences or required adapter sequences).

Reads are trimmed to remove any non-genomic sequences (including transposon/adapter sequences and any other technical sequences). Reads that are too short after trimming are removed from the analysis, to avoid issues during alignment.

The remaining (genomic) reads are aligned to the reference genome.

The resulting alignment is analyzed to identify the location and orientation of the corresponding insertion sites.

The exact implementation of these steps differs between pipelines and depends on the design of the sequencing experiment. For an overview of the available pipelines, see Pipelines.

An example of calling pyim-align using the shearsplink pipeline is as follows:

pyim-align shearsplink
    --reads ./reads.fastq.gz
    --bowtie_index /path/to/index
    --output_dir ./out
    --transposon /path/to/transposon.fa
    --linker /path/to/linker.fa

This produces an insertions.txt file in the ./out directory, describing the identified insertions.

Merging/splitting datasets¶

The pyim-merge command can be used to merge different sets of insertions. This is mainly useful for combining insertions from multiple samples or from different sequencing datasets. The basic command is as follows:

pyim-merge --insertions ./sample1/insertions.txt \
                        ./sample2/insertions.txt \
           --output ./merged.txt

This command adds an additional sample column to the merged insertions if this column is not yet present in the source files. By default, the used sample names are derived from the names of the folders containing the source files (sample1 and sample2 in this example). These names can be overridden using the --sample_names parameter.

Alternatively, the pyim-split command can be used to split a merged insertion file (containing multiple samples) to obtain separate insertion files for each sample. The basic command is as follows:

pyim-split --insertions ./merged.txt \
           --output_dir ./out

A specific subset of samples can be extracted using the --samples argument.

Annotating insertions¶

pyim-annotate window --insertions ./out/insertions.txt
                     --output ./out/insertions.ann.txt
                     --gtf reference.gtf
                     --window_size 20000

Identifying CISs¶

TODO