Sep 25, 2017

Public workspaceSequence Quality Control

  • 1Hurwitz Lab
  • Metafunc Course 2017
Icon indicating open access to content
QR code linking to this content
Protocol CitationJames E Thornton Jr 2017. Sequence Quality Control. protocols.io https://dx.doi.org/10.17504/protocols.io.j2icqce
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
Created: September 25, 2017
Last Modified: January 01, 2018
Protocol Integer ID: 7978
Abstract
This protocol will introduce a workflow for quality control and pre-processing of metagenomic sequence reads using FastQC for visualization and FastX Toolkit for editing the fastq files. 
Login to the HPC. 
Command
$ ssh hpc
$ ocelote
Make sure you have downloaded your FASTQ files for your project. This is detailed in the following protocol:
Move into the directory containing your FASTQ files. 
Command
Replace username with YOUR username.
$ cd /rsgrps/bh_class/username/fastq
If your FASTQ files are still compressed (.gz extension), make sure to uncompress them.
Command
Remember, you must be in the directory containing your fastq files for this to work.
$ gunzip *.gz
Load FastQC:
Note
The FastQC tools will provide visualization for the quality of the sequence reads for each of your samples.
Command
This command can be executed anywhere on the hpc to load the fastqc tool
$ module load fastqc
Make a directory for FastQC output. 
Command
Make sure to replace username with the name of YOUR directory containing YOUR work.
$ mkdir /rsgrps/bh_class/username/quality_control
Run FastQC on all of your fastq files and store the output in the directory you made in the previous step. 
Command
Make sure to move into YOUR directory containing YOUR fastq files. The -o command will put the output from fastqc into the directory specified.
$ fastqc *.fastq -o /rsgrps/bh_class/username/quality_control
Expected result
Started analysis of SRR1647144.fastq Approx 5% complete for SRR1647144.fastq Approx 10% complete for SRR1647144.fastq Approx 15% complete for SRR1647144.fastq Approx 20% complete for SRR1647144.fastq Approx 25% complete for SRR1647144.fastq Approx 30% complete for SRR1647144.fastq Approx 35% complete for SRR1647144.fastq Approx 40% complete for SRR1647144.fastq Approx 45% complete for SRR1647144.fastq Approx 50% complete for SRR1647144.fastq Approx 55% complete for SRR1647144.fastq Approx 60% complete for SRR1647144.fastq Approx 65% complete for SRR1647144.fastq Approx 70% complete for SRR1647144.fastq Approx 75% complete for SRR1647144.fastq Approx 80% complete for SRR1647144.fastq Approx 85% complete for SRR1647144.fastq Approx 90% complete for SRR1647144.fastq Approx 95% complete for SRR1647144.fastq Analysis complete for SRR1647144.fastq
Move into the quality_control directory that now contains the output of FastQC. Delete the .zip folders that were created (we only need the .html files). 
Command
CAUTION: rm -rf is a no going back command. Files removed are gone FOREVER. Make sure you are in the directory containing the FastQC files and execute as described above.
$ cd /rsgrps/bh_class/username/quality_control
$ rm -rf ./*.zip
In order to view the html summary files you must "secure copy (scp)" to your local machine. Open a new terminal (don't log into hpc). Determine where you want to store the files on your local machine and move into that directory. 
Note
Windows users using Cygwin, your file will be stored in C:/cygwin64/home/USER. Just open a new terminal window and proceed to next step.
Execute the following command to scp the html files to your local machine:
Note
Keep in mind that anytime FastQC is ran again and a new .html summary file is generated you must scp to your local machine in order to view it.
Note
At this step it asks me for the two-factor login again? Should I be getting this? It doesn't seem to have saved into my C:/cygwin/home/Emily folder on my computer either? (I have a PC)
Command
Replace jamesthornton with your own NETID. Also notice there is a period after *.html which indicates to put the files in the current directory. All files containing the .html extension will be copied to the current directory on the local machine.
$ scp jamesthornton@login.hpc.arizona.edu:/rsgrps/bh_class/jetjr/quality_control/*.html .
Alternative method to obtain .html files. (Skip if scp was successful for you).
If you are having trouble using scp, you can "push" the .html files to your abe487 github repository. This will require you to copy them into your abe487 directory (used for computational homework). Then do a git add, git commit, git push.
Command
Make sure you are in the quality_control directory that contains your .html files. Once complete, you can go to your github repository to download the report files for viewing.
$ pwd
/rsgrps/bh_class/username/quality_control
$ mkdir ~/abe487/fastqc
$ cp *.html ~/abe487/fastqc
$ cd !$
$ git add *.html
$ git commit -m 'adding qc reports'
$ git push
Now you can view the FastQC results from the .html files. 
Note
Again, Windows users using Cygwin will have to go to C:/cygwin64/home/USER directory on your windows machine to find the file. In the windows 10 interface open file explorer, click "This PC", double click on Windows (C:), then cygwin64, then home, then USER. Your files should be there.
Command
Will open all .html files in the current directory into your default browser. Remember this is on a local terminal (not connected to hpc). Double clicking on the files will work as well.
$ open ./*.html
Determine the quality control steps and parameters needed to improve the quality of the reads by looking at the html summary for each sample. Keep in mind that each sample (file) will likely have different steps and parameters. 
Load FastX Toolkit:
Command
The FastX toolkit will allow us to take quality control steps on our sequences.
$ module load fastx
Summary of tools available in the FastX toolkit can be viewed by the link given below.
Command line usage for these tools is here: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html
Note
We are showing you a limited view into this total software package. You can look here for more options.
The next few steps will introduce some FastX tools and how they can be used one at a time on a file. It's possible to use multiple FastX tools at the same time and this is demonstrated in step 17. 
The fastx_trimmer can be used if you see a decrease in quality at a specific base position:
Note
It is possible that only some or even none of your samples will be trimmed. Look at the FastQC output to determine this.
Note
base pair quality usually decreases with length for most NGS technologies
Note
fasqc can tell you where the drop in quality occurs for the sequences in a given file, reference the fastqc results to decide on the base pair position to trim from.
Command
where -f refers to the first base position and -l refers to the last NOTE: this is just an example of how to use the trimmer. The actually parameters to run depends on your samples.
$ fastx_trimmer -f 10 -l 200 -i [Infile] -o [outfile]
The fastq_quality_filter can be used to filter out reads that fail to reach a specific quality score:
Note
It's a good idea to run the quality filter on all of your samples, even if it the reads appear to have good quality already. The parameters used in the command on this step will work as a filter on your samples.
Note
Some reads are just bad, and have poor quality through out. We want to remove these reads. If these reads remain in the dataset, you will have issues down the line with assembly. Garbage in = garbage out
Command
-q refers to the minimum quality score to keep and -p is the minimum percent bases that must have -q quality
$ fastq_quality_filter -q 20 -p 80 -i [infile] -o [outfile]
The fastx_clipper can remove reads below a certain minimum length. Remove reads that are less than 70 base pairs long by executing the following command:
Note
After you trim your reads, some may be super short. These reads are usually not long enough to contribute to down stream analyses such as assembly and taxonomic or functional annotation (we will go through these analysis steps later in the semester).
Command
-l sets the minimum length of reads. Fastx_clipper will remove any reads < 70
$ fastx_clipper -l 70 -i [infile] -o [outfile]
Finally, the fastx_collapser will collapse identical sequences into a single one. The collapser should always be last in the workflow because the output will be in Fasta format instead of Fastq. 
Note
Here we are removing duplicates produced by the sequencing technology. These data can bias your final results, so they need to be removed.
Command
output will be in fasta format.
$ fastx_collapser -i [infile] -o [outfile]
You can also pipe together multiple commands: 
Note
Make files are also a great way to do this. See Ken's gitbook and your homework assignment for how to do this.
Note
You can also do this on the fly via a for loop. Note that I put the names of my files in a file called "list"
% for file in `cat list`; do > cat $file.fq | fastx_trimmer -f 12 -l 300 | fastq_quality_filter -q 20 -p 80 | fastx_clipper -l 60 | fastx_collapser > $file.fasta > done