=====
Usage
=====

Identifying insertions
----------------------

The **pyim-align** command is used to identify insertions using sequence reads
from targeted DNA-sequencing of insertion sites. The command provides access
to various pipelines which (in essence) perform the following functions:

    - Reads are filtered to remove reads that do not contain the correct
      technical sequences (such as transposon sequences or required adapter
      sequences).
    - Reads are trimmed to remove any non-genomic sequences (including
      transposon/adapter sequences and any other technical sequences). Reads
      that are too short after trimming are removed from the analysis, to
      avoid issues during alignment.
    - The remaining (genomic) reads are aligned to the reference genome.
    - The resulting alignment is analyzed to identify the location and
      orientation of the corresponding insertion sites.

The exact implementation of these steps differs between pipelines and depends
on the design of the sequencing experiment. For an overview of the available
pipelines, see :ref:`api_pipelines`.

An example of calling ``pyim-align`` using the ``shearsplink`` pipeline is
as follows:

.. code-block:: bash

    pyim-align shearsplink
        --reads ./reads.fastq.gz
        --bowtie_index /path/to/index
        --output_dir ./out
        --transposon /path/to/transposon.fa
        --linker /path/to/linker.fa

This produces an ``insertions.txt`` file in the ``./out`` directory,
describing the identified insertions.

Merging/splitting datasets
--------------------------

The **pyim-merge** command can be used to merge different sets of insertions.
This is mainly useful for combining insertions from multiple samples or from
different sequencing datasets. The basic command is as follows:

.. code-block:: bash

    pyim-merge --insertions ./sample1/insertions.txt \
                            ./sample2/insertions.txt \
               --output ./merged.txt

This command adds an additional ``sample`` column to the merged insertions
if this column is not yet present in the source files. By default, the used
sample names are derived from the names of the folders containing the
source files (``sample1`` and ``sample2`` in this example). These names can be
overridden using the ``--sample_names`` parameter.

Alternatively, the **pyim-split** command can be used to split a merged
insertion file (containing multiple samples) to obtain separate insertion
files for each sample. The basic command is as follows:

.. code-block:: bash

    pyim-split --insertions ./merged.txt \
               --output_dir ./out

A specific subset of samples can be extracted using the ``--samples`` argument.

Annotating insertions
---------------------

.. code-block:: bash

    pyim-annotate window --insertions ./out/insertions.txt
                         --output ./out/insertions.ann.txt
                         --gtf reference.gtf
                         --window_size 20000

Identifying CISs
----------------

TODO