License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: In development
We are still developing and optimizing this protocol
The opinions expressed here do not necessarily reflect the opinions of the Centers for Disease Control and Prevention or the institutions with which the authors are affiliated. The protocol content here is under development and is for informational purposes only and does not constitute legal, medical, clinical, or safety advice, or otherwise; content added to protocols.io is not peer reviewed and may not have undergone a formal approval of any kind. Information presented in this protocol should not substitute for independent professional judgment, advice, diagnosis, or treatment. Any action you take or refrain from taking using or relying upon the information presented here is strictly at your own risk. You agree that neither the Company nor any of the authors, contributors, administrators, or anyone else associated with protocols.io, can be held responsible for your use of the information contained in or linked to this protocol or any of our Sites/Apps and Services.
Abstract
This protocol covers the process of using Titan on the command-line, which was developed by Robert Petit at Wyoming Public Heath Laboratories. Some laboratories have the computing infrastructure and capacity to conduct local bioinformatic analyses and wish to use the Titan workflow, but not on the Terra platform. Local implementation of Titan-gc eliminates bottlenecks such as data upload to the cloud as well as offers users the ability to incorporate the process into scripts if they desire. Provided steps include how to install dependencies via Conda, create input files for the workflow, and run the pipeline directly or using a Nextflow wrapper. Basic troubleshooting tips are also discussed.
For technical assistance, please contact: TOAST@cdc.gov
Set up conda environment
Set up conda environment
The Titan workflow and its dependencies can be installed using the Conda package manager and the Bioconda channel for bioinformatics software. To install Conda, follow the instructions in Step 1.1. To update an existing Conda installation, go to Step 1.2. To add the Bioconda channel, go to Step 1.3.
This creates a new environment named titan-cli in which to install the titan-gc software. This Conda environment can be activated and deactivated as needed. When active, the environment name appears in parentheses to the left of the prompt. When deactivated, the normal prompt returns. For example:
Miniconda3 is implemented in Python 3 and the default versions of Python 3 installed in new environments depends on the installer. However, both Python 2.x and Python 3.x environments can still be installed with explicit parameter settings.
The input file is created using the included titan-gc-prepare.py script in the Conda environment:
Command
(titan-cli) [me@PC ~]$ titan-gc-prepare.py
usage: titan-gc-prepare [-h] [-f STR] [--fastq_separator STR]
[--fastq_pattern STR] [--pe1_pattern STR]
[--pe2_pattern STR] [-r] [--prefix STR] [--tsv]
[--pangolin_docker STR] [--params STR]
FASTQ_PATH WORKFLOW PRIMER
titan-gc-prepare - Read a directory and prepare a JSON for input to Titan GC
optional arguments:
-h, --help show this help message and exit
Titan-GC Prepare Parameters:
FASTQ_PATH Directory where FASTQ files are stored
WORKFLOW The Titan-GC workflow to use for analysis. Options:
clearlabs, illumina_pe, illumina_se, ont
PRIMER A file containing primers (bed format) used during
sequencing.
-f STR, --fastq_ext STR
Extension of the FASTQs. Default: .fastq.gz
--fastq_separator STR
Split FASTQ name on the last occurrence of the
separator. Default: _
--fastq_pattern STR Glob pattern to match FASTQs. Default: *.fastq.gz
--pe1_pattern STR Designates difference first set of paired-end reads.
Default: ([Aa]|[Rr]1|1) (R1, r1, 1, A, a)
--pe2_pattern STR Designates difference second set of paired-end reads.
Default: ([Bb]|[Rr]2|2) (R2, r2, 2, AB b)
-r, --recursive Directories will be traversed recursively
--prefix STR Replace the absolute path with a given string.
Default: Use absolute path
--tsv Output FOFN as a TSV (Default JSON)
Optional Titan-GC Workflow Parameters:
--pangolin_docker STR
Docker image used to run Pangolin (takes priority over
--params)
--params STR A JSON file containing parameter values
The script takes three required positional inputs (in this specific order): fastq path, workflow, and primer bed file. The script infers sample name from the sequencing read fastq filenames.
FASTQ_PATH: relative or absolute path to directory containing fastq files
Optional parameters to modify how the script looks for and processes fastq files:
-f, --fastq_ext: .fastq.gz file extension is removed to infer sample name
--fastq_separator: underscore is used to infer where the sample name is. i.e. _1 or _2 is removed
--fastq_pattern: .fastq.gz read files are looked for. If you want to change it to look for fastqs, you would put '*.fastq'
(do not forget the single quotes or the wildcard will expand and substitute the files names)
--pe[1/2]_pattern: specifies how to differentiate between left and right reads. Accepts R1/2, r1/2, 1/2, A/B, a/b
-r, --recursive: look not only in current folder for reads files but all folders in this folder
WORKFLOW: What workflow to run
Options: clearlabs, illumina_pe, illumina_se, ont
PRIMER: relative or absolute paths accepted for location of primer bed file
Optional parameters:
--pangolin_docker: specify docker tag for version of pangolin you want to run; not compatible with --tsv
--params: relative or absolute paths accepted for location of params.json; not compatible with --tsv
The titan-gc pipeline can be run two different ways:
Directly (Step 3.1)
With Nextflow (Step 3.2)
Differences between the two implementations are primarily related to call-caching (how samples are submitted to the pipeline), job submission, and STDOUT.
Call-caching: The Nextflow wrapper offers the benefits of call-caching, essentially allowing failed runs to be resumed without rerunning finished analyses. This saves time and compute resources. Running the pipeline directly (Step 3.1) uses cromwell run, and call-caching is only available from Cromwell's server mode, which will retry failed processes 10 times before the workflow is aborted.
Job submission: The Nextflow wrapper individually submits each sample to the Titan workflow. If one sample fails, downstream samples are not affected. Alternatively, when running the pipeline directly (Step 3.1), failed samples can cause the entire workflow to abort before remaining subsequently queued samples can be run to completion.
STDOUT: The Nextflow wrapper outputs little information, but provide a summary message to report the number of samples finished before merging results. Running the pipeline directly can output all of Cromwell's very extensive run logs to STDOUT.
Note
Note: the first time you run the pipeline, it will pull all the containers, a process that can take up to 20-30 minutes.
For Singularity, the default cache directory is "~/.singularity/cache".
You can change this with command:
"export SINGULARITY_CACHEDIR=/path/to/newcache"
(this must be added to your "~/.bashrc" file to make the change permanent)
For Docker, the default cache directory is "/var/lib/docker". Changing the cache directory for Docker is more complicated, requiring edits to a config file, and will not be discussed in this protocol.
Running pipeline directly:
Command
(titan-cli) [me@PC ~]$ titan-gc --help
usage: titan-gc [-h] [-i STR] [--inputs STR] [-o STR] [--outdir STR] [--options STR] [--quiet]
titan-gc-cli - Run Titan GC on a set of samples.
required arguments:
-i STR, --inputs STR The JSON file to be used with Cromwell for inputs.
-o STR, --outdir STR Output directory to store the final results in.
optional arguments:
-h, --help Show this help message and exit
--options STR JSON file containing Cromwell options
--profile STR The backend profile to use [options: docker, singularity]
--config STR Custom backend profile to use
--cromwell_jar STR Path to cromwell.jar (Default use conda install)
--quiet Silence all STDOUT from Cromwell and titan-gc-organize
The command takes two required inputs: input JSON file and output directory
-i: relative or absolute path to input JSON file
-o: relative or absolute path to output directory
Optional parameters:
--profile: what container software to use.
options: docker(default), singularity.
--options: provide JSON file to modify cromwell run attributes
The running pipeline should produce extensive text output of Cromwall's run information.
If a sample fails, because Cromwell provides no call-caching, there are no options except to rerun the sample. If the input.json file included multiple samples some may have successfully completed before the workflow aborted and a new input.json should be prepared for rerunning only the failed samples.
DISCLAIMER: The titan-nf Nextflow wrapper is not officially supported by the Titan development team and is a separate activity supported by Robert Petit in Wyoming. This workflow is under active development and frequent updates are expected.
First, pull the Nextflow wrapper, which is not included in the Conda installation. Nexflow itself should already be installed by default. This step only needs to be completed once or whenever the Nextflow wrapper is updated. The workflow will be stored at "$HOME/.nextflow/assets/steamboat/titan-nf".
Note: execute while your Conda environment is active unless your system has Nextflow.
Command
nextflow pull -hub gitlab -r main steamboat/titan-nf
Running the Nextflow wrapper:
Command
(titan-cli) [me@PC ~]$ nextflow run -hub gitlab -r main steamboat/titan-nf --help
N E X T F L O W ~ version 21.04.0
Launching `steamboat/titan-nf` [berserk_ekeblad] - revision: e36cccd145 [main]
Required Parameters:
--samples FILE File produced by titan-gc-prepare.py (--tsv used)
Optional Parameters:
--titan_opts FILE A JSON file with Titan GC parameters to use.
Default: Use Titan-GC's defaults
--pangolin_docker STR The pangolin docker image to use
Default: Use Titan-GC's defaults
--profile STR Profile for Titan GC to use
Options: docker or singularity
Default: singularity
--outdir DIR Directory to write results to
Default: ./
--run_name STR Name to use for the results folder
Default: titan-gc
The command takes one required input: input TSV file
--samples: relative or absolute paths taken for input TSV file (created using --tsv option of titan-gc-prepare.py)
Optional parameters:
--titan_opts: relative or absolute paths accepted for location of JSON parameter file
--pangolin_docker: specify docker tag for version of pangolin you want to run
--profile: what container software to use.
options: docker, singularity(default).
--outdir: directory in which output folder is created. Default is the current working directory.
--run_name: name of output folder. Default is titan-gc.
Example command:
Command
nextflow run -hub gitlab -r main steamboat/titan-nf --samples input.txt
Example of running nextflow wrapper:
Command
(titan-cli) [me@PC test_seq]$ nextflow run -hub gitlab -r main steamboat/titan-nf --samples input.txt
N E X T F L O W ~ version 21.04.0
Launching `steamboat/titan-nf` [clever_mercator] - revision: 3af325405e [main]
executor > local (2)
[6e/1c5eb5] process > titan_gc (SRR13422799) [100%] 1 of 1 ✔
[41/fccd35] process > merge_results [100%] 1 of 1 ✔
Completed at: 16-Jul-2021 12:33:51
Duration : 3m 59s
CPU hours : 0.3
Succeeded : 2
If a sample fails, use the -resume option to retry the sample beyond the initial number of Nextflow attempts (default: 5 times). Some failures are not reproducible, and simply rerunning may solve the issue. Completed samples will not be rerun, thanks to call-caching, so the same input file can be reused without excluding the completed sampled.
(titan-cli) [me@PC test_seq]$ nextflow run -hub gitlab -r main steamboat/titan-nf --samples input.txt -resume
N E X T F L O W ~ version 21.04.0
Launching `steamboat/titan-nf` [ecstatic_bhabha] - revision: 3af325405e [main]
[6e/1c5eb5] process > titan_gc (SRR13422799) [100%] 1 of 1, cached: 1 ✔
[41/fccd35] process > merge_results [100%] 1 of 1, cached: 1 ✔
Output
Output
The output directory will include a structure similar to that shown below. Running the titan-gc pipeline using the Nextflow wrapper produces an additional "nf-info" folder. Read more about the Nextflow outputs at: https://www.nextflow.io/docs/latest/tracing.html
There are two places to investigate when troubleshooting:
Cromwell STDOUT: the STDOUT messages can indicate which step failed and where in the working directory to look
Cromwell working directory: this holds the STDOUT and STDERR log files for each task run
Cromwell STDOUT:
Running pipeline directly: cromwell-stdout.txt will be in your specified output directory
Running Nextflow wrapper: cromwell-stdout.txt will be in the titan-gc folder in the unique hash identifier folders in the work folder (which is by default in the directory you ran the pipeline from), e.g.:
Running pipeline directly: There will be a cromwell-executions folder (which is by default in the directory you ran the pipeline from) which contains a titan-gc folder which contains unique hash identifier folders, e.g.:
Running nextflow wrapper: In the nextflow work folder under the unique hash identifier, there will be a cromwell-executions folder which contains a titan-gc folder which contains unique hash identifier folders, e.g.: