Configuration¶
The pipeline is configured using two config files, samples.tsv
and
config.yaml
. The samples.tsv
file defines the input samples,
together with the paths to the respective source files. The config.yaml
file provides detailed configuration options for the different
steps of the pipeline.
Sample definition¶
The sample definition file is a tab-separated file that lists the samples that are to be processed by the pipeline. Each row represents a single set of (paired-end) fastq files for a given sample on a given lane. As such, samples that have been sequenced on multiple lanes with typically span multiple rows that share the same sample ID.
For example, a single sample sequenced over two lanes would be described as follows:
sample lane fastq1 fastq2
S1 L001 S1.L001.R1.fastq.gz S1.L001.R2.fastq.gz
S1 L002 S1.L002.R1.fastq.gz S1.L002.R2.fastq.gz
The fastq1
and fastq2
columns should contain paths to the input files
of each of the pairs, which are expected to be fastq files. These paths can
either be provided as local relative/absolute paths, or as remote http/ftp urls.
Note that relative file paths are taken relative to the input directory
defined in the configuration file (see below for more details), if specified.
The lane
column is used to distinguish sequencing data from the same sample
that has been sequenced in different lanes. This column can be filled with
dummy values (i.e. L999) if lane information is not available and samples
were sequenced on a single lane.
Pipeline configuration¶
The individual steps of the pipeline are configured using the config.yaml
file. This config file contains three different sections, which define
general pipeline options, options for the inputs and options for each
specific rule, respectively.
General options¶
The general section defines several options regarding the high-level behavior
of the pipeline. The section currently defines a single option pdx
, which
determines whether the ‘standard’ or PDX version of the workflow is run:
################################################################################
# Pipeline options #
################################################################################
options:
pdx: False
Note that a host index should be supplied for the bwa
rule (using the
index_host
option, see rule options below for more details) when running
the PDX workflow.
Input options¶
The input section defines several options regarding the handling of the input files:
################################################################################
# Input configuration #
################################################################################
input:
# Optional: input directory to use for fastq files (for local input files).
dir: 'input'
# Optional: configuration to use for remote (FTP) input files.
# ftp:
# username: 'user'
# password: 'pass'
Here, dir
is an optional value that defines the directory containing
the input files. If given, file paths provided in samples.tsv
are
sought relative to this directory. Its value is ignored if http/ftp urls
are used for the inputs.
The ftp
section defines the username/password to use when downloading
samples over an ftp connection. These values can be omitted when downloading
files from an anonymous ftp server.
Rule options¶
The rule section provides detailed configuration options for the different
rule of the workflow. In general, each rule has a set of configurable options
under the same name as the step itself. The options themselves are specific for
each step and the corresponding tool, but each step typically has an extra
option, which allows you to pass arbitrary arguments to the underlying tool.
################################################################################
# Rule configuration #
################################################################################
#### General/shared rules ####
cutadapt:
# Example trims Illumina Truseq PE adapters, removes low quality bases and
# drops any mates with either read shorter than 60bp after trimming.
extra: >-
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
-q 20
--minimum-length 60
fastqc:
extra: ''
bwa:
index: '/path/to/index'
# Host genome index, only required for PDX samples. If used,
# index (above) should refer to the graft genome index.
index_host: '/path/to/host_index'
# Read group information (which is critical for many downstream steps).
# Replace 'CENTRE' with the name of your company/institute.
readgroup: "\"@RG\\tID:{sample}.{lane}\\tSM:{sample}\\tLB:{sample}\\t\
PU:{sample}.{lane}\\tPL:ILLUMINA\\tCN:CENTRE\""
extra: '-M'
sort_extra: ''
threads: 5
samtools_merge:
extra: ''
threads: 5
picard_mark_duplicates:
extra: >-
-XX:ParallelGCThreads=5
VALIDATION_STRINGENCY=LENIENT
picard_collect_hs_metrics:
reference: 'input/genome.fa'
bait_intervals: 'input/capture.intervals'
target_intervals: 'input/capture.intervals'
extra: '-Xmx4g'
multiqc:
extra: ''
#### PDX-specific rules ####
disambiguate:
extra: ''
sambamba_sort:
extra: ''
threads: 5
Note that this section is divided into two sub-sections: general and PDX-specific. The PDX-specific section contains additional options for rules that are only used in the PDX workflow.