Quality control for metagenomics data

Qi Wang

Apr 16, 2020

Quality control for metagenomics data

GigaByte

Peer-reviewed method

DOI

dx.doi.org/10.17504/protocols.io.be68jhhw

Qi Wang¹

¹BGI

GigaScience Press
BGI

Qi Wang

DOI: dx.doi.org/10.17504/protocols.io.be68jhhw

External link: https://doi.org/10.46471/gigabyte.12

Protocol Citation: Qi Wang 2020. Quality control for metagenomics data. protocols.io https://dx.doi.org/10.17504/protocols.io.be68jhhw

Manuscript citation:

Qi Wang, Qiang Sun, Xiaoping Li, Zhefeng Wang, Haotian Zheng, Yanmei Ju, Ruijin Guo, Songlin Peng, Huijue Jia, Linking gut microbiome to bone mineral density: a shotgun metagenomic dataset from 361 elderly women, Gigabyte, 2021 https://doi.org/10.46471/gigabyte.12

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: April 16, 2020

Last Modified: April 16, 2020

Protocol Integer ID: 35776

Keywords: Quality control, Metagenomics data, remove low quality reads, remove host contamination reads

Abstract

Quality control for metagenomics data,including: remove low quality reads and host contamination reads.

Guidelines

Quality control for metagenomics data,including: remove low quality reads and host contamination reads.

Safety warnings

No

Before start

The user should provided the single or paired metagenomics data. 

Step1: remove low quality reads：
We firstly calculate the accuracy probabilities of each base using the following equations:
1)Q = -10 log10E
2)P = 1 - E
WhereQ is thePhred quality score of each base,Eis the error probability of each base.

Then we calculate the overall accuracy probability(OA)of each readusing the following equation:

For each read, an initial 30-mer seed sequence is selected at the 5’ end of the read and its overall accuracy, defined as OAseed, is calculated. To ensure high data quality, OAseedis defined as 0.9 usingm equal to 0with zero low quality bases allowed. Once the seed position of the read has been defined, the seed would extend to keep the longest contiguous read fragment in which the OA, defined as OAfragment, is above a defined accuracy threshold. In this study, weset OAfragmentequal to or greater than 0.8 usingmequal to 1.
And the Perl scripts for overall accuracy based QC pipeline are freely available for download and reuse from  Github (https://github.com/Scelta/OAFilter).

Step2: remove host contamination reads by one command:
'bowtie2 --very-sensitive -p $thread -x $host_bowtie2_index -1 $sample_r2 -2 $sample_r1 2> 02.rmhost/bowtie2.log | samtools view -h | samtools sort -n |samtools fastq -N -c 5 -f 12 -1 02.rmhost/$name.rmhost.1.fq.gz -2 02.rmhost/$name.rmhost.2.fq.gz'

Public workspaceQuality control for metagenomics data

Quality control for metagenomics data