Configuration¶
The pipeline is configured using two config files, samples.tsv
and
config.yaml
. The samples.tsv
file defines the input samples,
together with the paths to the respective source files. The config.yaml
file provides detailed configuration options for the different
steps of the pipeline.
Sample definition¶
The sample definition file is a tab-separated file that lists the samples that are to be processed by the pipeline. Each row represents a single set of (paired-end) fastq files for a given sample on a given lane. As such, samples that have been sequenced on multiple lanes with typically span multiple rows that share the same sample ID.
For example, a single sample sequenced over two lanes would be described as follows:
sample lane fastq1 fastq2
S1 L001 S1.L001.R1.fastq.gz S1.L001.R2.fastq.gz
S1 L002 S1.L002.R1.fastq.gz S1.L001.R2.fastq.gz
The fastq1
and fastq2
columns should contain paths to the input files
of each of the pairs, which are expected to be fastq files. These paths can
either be provided as local relative/absolute paths, or as remote http/ftp urls.
Note that relative file paths are taken relative to the input directory
defined in the configuration file (see below for more details), if specified.
For single-end data, the fastq2
column should be omitted.
The lane
column is used to distinguish sequencing data from the same sample
that has been sequenced in different lanes. This column can be filled with
dummy values (i.e. L999) if lane information is not available and samples
were sequenced on a single lane.
Pipeline configuration¶
The individual steps of the pipeline are configured using the config.yaml
file. This config file contains two different sections, which define
configurations for the inputs and for each specific rule, respectively.
Input options¶
The input section defines several options regarding the handling of the input files:
################################################################################
# Input configuration #
################################################################################
input:
# Optional: input directory to use for fastq files (for local input files).
dir: 'input'
# Optional: configuration to use for remote (FTP) input files.
# ftp:
# username: 'user'
# password: 'pass'
Here, dir
is an optional value that defines the directory containing
the input files. If given, file paths provided in samples.tsv
are
sought relative to this directory. Its value is ignored if http/ftp urls
are used for the inputs.
The ftp
section defines the username/password to use when downloading
samples over an ftp connection. These values can be omitted when downloading
files from an anonymous ftp server.
Rule options¶
The rule section provides detailed configuration options for the different
rule of the workflow. In general, each rule has a set of configurable options
under the same name as the step itself. The options themselves are specific for
each step and the corresponding tool, but each step typically has an extra
option, which allows you to pass arbitrary arguments to the underlying tool.
################################################################################
# Rule configuration #
################################################################################
#### General/shared rules ####
# Single-end config for cutadapt.
cutadapt_se:
extra: >-
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
--minimum-length 20
# Paired-end config for cutadapt.
cutadapt_pe:
extra: >-
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
-q 20
--minimum-length 20
star:
index: '/path/to/index'
# Host genome index, only required for PDX processing.
# index_host: '/path/to/host/index'
readgroup: >-
ID:{sample}.{lane} SM:{sample} LB:{sample}
PU:{sample}.{lane} PL:ILLUMINA CN:CENTRE
extra: ""
threads: 10
sambamba_sort:
extra: ''
threads: 10
samtools_merge:
extra: ''
threads: 10
feature_counts:
annotation: '/path/to/gtf'
extra: ''
threads: 5
multiqc:
extra: ''
fastqc:
extra: ''
#### PDX-specific rules ####
disambiguate:
extra: ''
sambamba_sort:
extra: ''
threads: 5
Note that this section is divided into two sub-sections: general and PDX-specific. The PDX-specific section contains additional options for rules that are only used in the PDX workflow (see below)..
Standard vs PDX mode¶
The PDX workflow is triggered if a host genome index is supplied by the
index_host
option in the configuration for star
. If this option is
omitted or left empty, the standard workflow is executed.