Configuration

The pipeline is configured using two config files, samples.tsv and config.yaml. The samples.tsv file defines the input samples, together with the paths to the respective source files. The config.yaml file provides detailed configuration options for the different steps of the pipeline.

Sample definition

The sample definition file is a tab-separated file that lists the samples that are to be processed by the pipeline. Each row represents a single set of (paired-end) fastq files for a given sample on a given lane. As such, samples that have been sequenced on multiple lanes with typically span multiple rows that share the same sample ID.

For example, a single sample sequenced over two lanes would be described as follows:

sample	lane	fastq1	fastq2
S1	L001	S1.L001.R1.fastq.gz	S1.L001.R2.fastq.gz
S1	L002	S1.L002.R1.fastq.gz	S1.L002.R2.fastq.gz

The fastq1 and fastq2 columns should contain paths to the input files of each of the pairs, which are expected to be fastq files. These paths can either be provided as local relative/absolute paths, or as remote http/ftp urls. Note that relative file paths are taken relative to the input directory defined in the configuration file (see below for more details), if specified.

The lane column is used to distinguish sequencing data from the same sample that has been sequenced in different lanes. This column can be filled with dummy values (i.e. L999) if lane information is not available and samples were sequenced on a single lane.

Pipeline configuration

The individual steps of the pipeline are configured using the config.yaml file. This config file contains three different sections, which define general pipeline options, options for the inputs and options for each specific rule, respectively.

General options

The general section defines several options regarding the high-level behavior of the pipeline. The section currently defines a single option pdx, which determines whether the ‘standard’ or PDX version of the workflow is run:

################################################################################
# Pipeline options                                                             #
################################################################################

options:
  pdx: False

Note that a host index should be supplied for the bwa rule (using the index_host option, see rule options below for more details) when running the PDX workflow.

Input options

The input section defines several options regarding the handling of the input files:

################################################################################
# Input configuration                                                          #
################################################################################

input:
  # Optional: input directory to use for fastq files (for local input files).
  dir: 'input'

  # Optional: configuration to use for remote (FTP) input files.
  # ftp:
  #   username: 'user'
  #   password: 'pass'

Here, dir is an optional value that defines the directory containing the input files. If given, file paths provided in samples.tsv are sought relative to this directory. Its value is ignored if http/ftp urls are used for the inputs.

The ftp section defines the username/password to use when downloading samples over an ftp connection. These values can be omitted when downloading files from an anonymous ftp server.

Rule options

The rule section provides detailed configuration options for the different rule of the workflow. In general, each rule has a set of configurable options under the same name as the step itself. The options themselves are specific for each step and the corresponding tool, but each step typically has an extra option, which allows you to pass arbitrary arguments to the underlying tool.

################################################################################
# Rule configuration                                                           #
################################################################################

#### General/shared rules ####

cutadapt:
  # Example trims Illumina Truseq PE adapters, removes low quality bases and
  # drops any mates with either read shorter than 60bp after trimming.
  extra: >-
    -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
    -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
    -q 20
    --minimum-length 60

fastqc:
  extra: ''

bwa:
  index: '/path/to/index'
  # Host genome index, only required for PDX samples. If used,
  # index (above) should refer to the graft genome index.
  index_host: '/path/to/host_index'
  # Read group information (which is critical for many downstream steps).
  # Replace 'CENTRE' with the name of your company/institute.
  readgroup: "\"@RG\\tID:{sample}.{lane}\\tSM:{sample}\\tLB:{sample}\\t\
              PU:{sample}.{lane}\\tPL:ILLUMINA\\tCN:CENTRE\""
  extra: '-M'
  sort_extra: ''
  threads: 5

samtools_merge:
  extra: ''
  threads: 5

picard_mark_duplicates:
  extra: >-
    -XX:ParallelGCThreads=5
    VALIDATION_STRINGENCY=LENIENT

picard_collect_hs_metrics:
    reference: 'input/genome.fa'
    bait_intervals: 'input/capture.intervals'
    target_intervals: 'input/capture.intervals'
    extra: '-Xmx4g'

multiqc:
  extra: ''


#### PDX-specific rules ####

disambiguate:
  extra: ''

sambamba_sort:
  extra: ''
  threads: 5

Note that this section is divided into two sub-sections: general and PDX-specific. The PDX-specific section contains additional options for rules that are only used in the PDX workflow.