Sequence Quality Control

James Thornton Jr

Sep 25, 2017

Sequence Quality Control

Forked from MG_HW3: Quality Control and Pre-processing

DOI

dx.doi.org/10.17504/protocols.io.j2icqce

James E Thornton Jr¹

¹Hurwitz Lab

Metafunc Course 2017

James E Thornton Jr

Hurwitz Lab

DOI: dx.doi.org/10.17504/protocols.io.j2icqce

Protocol Citation: James E Thornton Jr 2017. Sequence Quality Control. protocols.io https://dx.doi.org/10.17504/protocols.io.j2icqce

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

Created: September 25, 2017

Last Modified: January 01, 2018

Protocol Integer ID: 7978

Abstract

This protocol will introduce a workflow for quality control and pre-processing of metagenomic sequence reads using FastQC for visualization and FastX Toolkit for editing the fastq files. 

Guidelines

FastX Toolkit Documentation

Login to the HPC. 
Command
$ ssh hpc
$ ocelote

Make sure you have downloaded your FASTQ files for your project. This is detailed in the following protocol:
 
https://www.protocols.io/view/getting-started-on-your-project-jsacnae

Move into the directory containing your FASTQ files. 
Command
Replace username with YOUR username.
$ cd /rsgrps/bh_class/username/fastq

If your FASTQ files are still compressed (.gz extension), make sure to uncompress them.
Command
Remember, you must be in the directory containing your fastq files for this to work.
$ gunzip *.gz

Load FastQC:
Note
The FastQC tools will provide visualization for the quality of the sequence reads for each of your samples. 
Command
This command can be executed anywhere on the hpc to load the fastqc tool
$ module load fastqc

Make a directory for FastQC output. 
Command
Make sure to replace username with the name of YOUR directory containing YOUR work. 
$ mkdir /rsgrps/bh_class/username/quality_control

Run FastQC on all of your fastq files and store the output in the directory you made in the previous step. 
Command
Make sure to move into YOUR directory containing YOUR fastq files.
The -o command will put the output from fastqc into the directory specified.
$ fastqc *.fastq -o /rsgrps/bh_class/username/quality_control
Expected result
Started analysis of SRR1647144.fastq
Approx 5% complete for SRR1647144.fastq
Approx 10% complete for SRR1647144.fastq
Approx 15% complete for SRR1647144.fastq
Approx 20% complete for SRR1647144.fastq
Approx 25% complete for SRR1647144.fastq
Approx 30% complete for SRR1647144.fastq
Approx 35% complete for SRR1647144.fastq
Approx 40% complete for SRR1647144.fastq
Approx 45% complete for SRR1647144.fastq
Approx 50% complete for SRR1647144.fastq
Approx 55% complete for SRR1647144.fastq
Approx 60% complete for SRR1647144.fastq
Approx 65% complete for SRR1647144.fastq
Approx 70% complete for SRR1647144.fastq
Approx 75% complete for SRR1647144.fastq
Approx 80% complete for SRR1647144.fastq
Approx 85% complete for SRR1647144.fastq
Approx 90% complete for SRR1647144.fastq
Approx 95% complete for SRR1647144.fastq
Analysis complete for SRR1647144.fastq

Move into the quality_control directory that now contains the output of FastQC. Delete the .zip folders that were created (we only need the .html files). 
Command
CAUTION: rm -rf is a no going back command. Files removed are gone FOREVER. Make sure you are in the directory containing the FastQC files and execute as described above.
$ cd /rsgrps/bh_class/username/quality_control
$ rm -rf ./*.zip

In order to view the html summary files you must "secure copy (scp)" to your local machine. Open a new terminal (don't log into hpc). Determine where you want to store the files on your local machine and move into that directory. 
Note
Windows users using Cygwin, your file will be stored in C:/cygwin64/home/USER. Just open a new terminal window and proceed to next step. 

Execute the following command to scp the html files to your local machine:
 
Note
Keep in mind that anytime FastQC is ran again and a new .html summary file is generated you must scp to your local machine in order to view it. 
Note
At this step it asks me for the two-factor login again? Should I be getting this? It doesn't seem to have saved into my C:/cygwin/home/Emily folder on my computer either? (I have a PC)
Command
Replace jamesthornton with your own NETID. Also notice there is a period after *.html which indicates to put the files in the current directory.
All files containing the .html extension will be copied to the current directory on the local machine.
$ scp jamesthornton@login.hpc.arizona.edu:/rsgrps/bh_class/jetjr/quality_control/*.html .

Alternative method to obtain .html files. (Skip if scp was successful for you).
 
If you are having trouble using scp, you can "push" the .html files to your abe487 github repository. This will require you to copy them into your abe487 directory (used for computational homework). Then do a git add, git commit, git push.
Command
Make sure you are in the quality_control directory that contains your .html files.
Once complete, you can go to your github repository to download the report files for viewing.
$ pwd
/rsgrps/bh_class/username/quality_control
$ mkdir ~/abe487/fastqc
$ cp *.html ~/abe487/fastqc
$ cd !$
$ git add *.html
$ git commit -m 'adding qc reports'
$ git push

Now you can view the FastQC results from the .html files. 
Note
Again, Windows users using Cygwin will have to go to C:/cygwin64/home/USER directory on your windows machine to find the file. In the windows 10 interface open file explorer, click "This PC", double click on Windows (C:), then cygwin64, then home, then USER. Your files should be there. 
Command
Will open all .html files in the current directory into your default browser.
Remember this is on a local terminal (not connected to hpc). Double clicking on the files will work as well.
$ open ./*.html

Determine the quality control steps and parameters needed to improve the quality of the reads by looking at the html summary for each sample. Keep in mind that each sample (file) will likely have different steps and parameters. 

Load FastX Toolkit:
Command
The FastX toolkit will allow us to take quality control steps on our sequences.
$ module load fastx

Summary of tools available in the FastX toolkit can be viewed by the link given below.
 
Command line usage for these tools is here: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html
 http://hannonlab.cshl.edu/fastx_toolkit/ 
Note
We are showing you a limited view into this total software package.  You can look here for more options.

The next few steps will introduce some FastX tools and how they can be used one at a time on a file. It's possible to use multiple FastX tools at the same time and this is demonstrated in step 17. 

The fastx_trimmer can be used if you see a decrease in quality at a specific base position:
Note
It is possible that only some or even none of your samples will be trimmed. Look at the FastQC output to determine this. 
Note
base pair quality usually decreases with length for most NGS technologies
Note
fasqc can tell you where the drop in quality occurs for the sequences in a given file, reference the fastqc results to decide on the base pair position to trim from.
Command
where -f refers to the first base position and -l refers to the last
NOTE: this is just an example of how to use the trimmer. The actually parameters to run depends on your samples.
$ fastx_trimmer -f 10 -l 200 -i [Infile] -o [outfile]

The fastq_quality_filter can be used to filter out reads that fail to reach a specific quality score:
Note
It's a good idea to run the quality filter on all of your samples, even if it the reads appear to have good quality already. The parameters used in the command on this step will work as a filter on your samples. 
Note
Some reads are just bad, and have poor quality through out.  We want to remove these reads.  If these reads remain in the dataset, you will have issues down the line with assembly.  Garbage in = garbage out
Command
-q refers to the minimum quality score to keep and -p is the minimum percent bases that must have -q quality
$ fastq_quality_filter -q 20 -p 80 -i [infile] -o [outfile]

The fastx_clipper can remove reads below a certain minimum length. Remove reads that are less than 70 base pairs long by executing the following command:
Note
After you trim your reads, some may be super short.  These reads are usually not long enough to contribute to down stream analyses such as assembly and taxonomic or functional annotation (we will go through these analysis steps later in the semester).
Command
-l sets the minimum length of reads. Fastx_clipper will remove any reads < 70
$ fastx_clipper -l 70 -i [infile] -o [outfile]

Finally, the fastx_collapser will collapse identical sequences into a single one. The collapser should always be last in the workflow because the output will be in Fasta format instead of Fastq. 
Note
Here we are removing duplicates produced by the sequencing technology.  These data can bias your final results, so they need to be removed.
Command
output will be in fasta format.
$ fastx_collapser -i [infile] -o [outfile]

You can also pipe together multiple commands: 
Note
Make files are also a great way to do this.  See Ken's gitbook and your homework assignment for how to do this.
Note
You can also do this on the fly via a for loop.  Note that I put the names of my files in a file called "list"
% for file in `cat list`; do
> cat $file.fq | fastx_trimmer -f 12 -l 300 | fastq_quality_filter -q 20 -p 80 | fastx_clipper -l 60 | fastx_collapser > $file.fasta
> done

Public workspaceSequence Quality Control

Sequence Quality Control