Overview ======== The workflow can be run in two modes: the 'standard' workflow (for normal sequencing) data and a 'PDX' workflow, which handles host/graft read separation for patient-derived xenograft samples. Standard workflow ----------------- The standard (non-PDX) workflow essentially performs the following steps: * Cutadapt is used to trim the input reads for adapters and/or poor-quality base calls. * The trimmed reads are aligned to the reference genome using STAR. * The resulting alignments are sorted and indexed using sambamba. * featureCounts is used to generate gene expression counts. * The (per sample) counts are merged into a single count file. * The merged counts are normalized for differences in sequencing depth (using DESeq's median-of-ratios approach) and log-transformed. QC statistics are generated using fastqc and samtools stats. The stats are summarized into a single report using multiqc. Altogether, this results in the following dependency graph: .. figure:: images/dag.svg :align: center PDX workflow ------------ The PDX workflow is a slightly modified version of the standard workflow, which aligns the reads to two reference genome (the host and graft reference genomes) and uses disambiguate_ to remove sequences originating from the host organism. For typical PDX samples, this means removing the mouse (host) reads, leaving the human (graft) reads for further analysis. The PDX workflow adds the following additional steps: * The reads are aligned to two references in ``star_host`` and ``star_graft``. * The resulting alignments are sorted by queryname using samtools and subsequently 'disambiguated' using the disamgibuate tool from AstraZeneca. * The disambiguated alignments are sorted by coordinate using sambamba. This results in the following dependency graph: .. figure:: images/dag_pdx.svg :align: center .. _disambiguate: https://github.com/AstraZeneca-NGS/disambiguate