Sequencing and data quality control

Adriana Alberti; Julie Poulain; Stefan Engelen; Karine Labadie; Sarah Romac; Isabel Ferrera; Guillaume Albini; Jean-Marc Aury; Caroline Belser; Alexis Bertrand; Corinne Cruaud; Corinne Da Silva; Carole Dossat; Frédéric vory; Shahinaz Gas; Julie Guy; Maud Haquelle; E'krame Jacoby; Olivier Jaillon; Arnaud Lemainque; Eric Pelletier; Gaëlle amson; Marc Wessner; Genoscope Technical Team; Silvia G. Acinas; Marta Royo-Llonch; Francisco M. Cornejo-Castillo; Ramiro Logares; Beatriz Fernández-Gómez; Chris Bowler; Guy Cochrane; Clara Amid; Petra Ten Hoopen; Colomban De Vargas; Nigel Grimsley; Elodie Desgranges; Stefanie Kandels-Lewis; Hiroyuki Ogata; Nicole Poulton; Michael E. Sieracki; Ramunas Stepanauskas; Matthew B. Sullivan; Jennifer R. Brum; Melissa B. Duhaime; Bonnie T. Poulos; Bonnie L. Hurwitz; Stéphane esant; Eric Karsenti; Patrick Wincker

Jan 09, 2020

Sequencing and data quality control

DOI

https://dx.doi.org/10.17504/protocols.io.qwjdxcn

Adriana Alberti¹,
Julie Poulain²,
Stefan Engelen²,
Karine Labadie²,
Sarah Romac^3,4,
Isabel Ferrera⁵,
Guillaume Albini²,
Jean-Marc Aury²,
Caroline Belser²,
Alexis Bertrand²,
Corinne Cruaud²,
Corinne Da Silva²,
Carole Dossat²,
Frédéric vory²,
Shahinaz Gas²,
Julie Guy²,
Maud Haquelle²,
E'krame Jacoby²,
Olivier Jaillon^2,6,7,
Arnaud Lemainque²,
Eric Pelletier²,
Gaëlle amson²,
Marc Wessner²,
Genoscope Technical Team²,
Silvia G. Acinas⁵,
Marta Royo-Llonch⁵,
Francisco M. Cornejo-Castillo⁵,
Ramiro Logares⁵,
Beatriz Fernández-Gómez^5,8,9,
Chris Bowler¹⁰,
Guy Cochrane¹¹,
Clara Amid¹¹,
Petra Ten Hoopen¹¹,
Colomban De Vargas^3,4,
Nigel Grimsley^12,13,
Elodie Desgranges^12,13,
Stefanie Kandels-Lewis^14,15,
Hiroyuki Ogata¹⁶,
Nicole Poulton¹⁷,
Michael E. Sieracki^17,18,
Ramunas Stepanauskas¹⁷,
Matthew B. Sullivan^19,20,
Jennifer R. Brum^20,21,
Melissa B. Duhaime²²,
Bonnie T. Poulos²³,
Bonnie L. Hurwitz²⁴,
Stéphane esant^25,26,
Eric Karsenti^10,14,27,
Patrick Wincker^2,6,7

¹CEA, Institut de Biologie Intégrative de la Cellule;
²CEA - Institut de Biologie François Jacob, Genoscope, Evry, France;
³CNRS, UMR 7144, Station Biologique de Roscoff, France;
⁴Sorbonne Universités, UPMC Univ Paris 06, UMR 7144, Station Biologique de Roscoff, France;
⁵Departament de Biologia Marina i Oceanografia, Institute of Marine Sciences (ICM), CSIC, Barcelona, Spain;
⁶CNRS, UMR 8030, Evry , France;
⁷Université d'Evry, UMR 8030, Evry, France;
⁸FONDAP Center for Genome Regulation, Santiago, Chile;
⁹Laboratorio de Bioinformática y Expresión Génica, Instituto de Nutrición y Tecnología de los Alimentos (INTA), Universidad de Chile, El Libano Macul, Santiago, Chile;
¹⁰Ecole Normale Supérieure, PSL Research University, Institut de Biologie de l’Ecole Normale Supérieure (IBENS), CNRS UMR 8197, INSERM U1024, Paris, France;
¹¹European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genomes Campus, Hinxton, Cambridge , UK;
¹²CNRS UMR 7232, BIOM, Banyuls-sur-Mer, France;
¹³Sorbonne Universités Paris 06, OOB UPMC, Banyuls-sur-Mer , France;
¹⁴Directors’ Research European Molecular Biology Laboratory, Heidelberg, Germany;
¹⁵Structural and Computational Biology, European Molecular Biology Laboratory, Heidelberg, Germany;
¹⁶Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, Japan;
¹⁷Bigelow Laboratory for Ocean Sciences, East Boothbay, Maine, USA;
¹⁸National Science Foundation, Arlington, Virginia, USA;
¹⁹Departments of Microbiology and Civil, Environmental and Geodetic Engineering, Ohio State University, Columbus, Ohio, USA;
²⁰Department of Microbiology, The Ohio State University, Columbus, Ohio, USA;
²¹Present address: Department of Oceanography and Coastal Sciences, Louisiana State University, Baton Rouge, Louisiana, USA;
²²Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, USA;
²³University of Arizona, Tucson, Arizona, USA;
²⁴Department of Agricultural and Biosystems Engineering, University of Arizona, Tucson, Arizona, USA;
²⁵MARUM, Center for Marine Environmental Sciences, University of Bremen, Germany;
²⁶PANGAEA, Data Publisher for Earth and Environmental Science, University of Bremen, Germany;
²⁷Sorbonne Universités, UPMC Université Paris 06, CNRS, Laboratoire d’oceanographie de Villefranche (LOV), Observatoire Océanologique, Villefranche-sur-mer, France

Tara Oceans

Adriana Alberti

CEA, Institut de Biologie Intégrative de la Cellule

DOI: https://dx.doi.org/10.17504/protocols.io.qwjdxcn

External link: https://www.nature.com/articles/sdata201793#methods

Protocol Citation: Adriana Alberti, Julie Poulain, Stefan Engelen, Karine Labadie, Sarah Romac, Isabel Ferrera, Guillaume Albini, Jean-Marc Aury, Caroline Belser, Alexis Bertrand, Corinne Cruaud, Corinne Da Silva, Carole Dossat, Frédéric vory, Shahinaz Gas, Julie Guy, Maud Haquelle, E'krame Jacoby, Olivier Jaillon, Arnaud Lemainque, Eric Pelletier, Gaëlle amson, Marc Wessner, Genoscope Technical Team, Silvia G. Acinas, Marta Royo-Llonch, Francisco M. Cornejo-Castillo, Ramiro Logares, Beatriz Fernández-Gómez, Chris Bowler, Guy Cochrane, Clara Amid, Petra Ten Hoopen, Colomban De Vargas, Nigel Grimsley, Elodie Desgranges, Stefanie Kandels-Lewis, Hiroyuki Ogata, Nicole Poulton, Michael E. Sieracki, Ramunas Stepanauskas, Matthew B. Sullivan, Jennifer R. Brum, Melissa B. Duhaime, Bonnie T. Poulos, Bonnie L. Hurwitz, Stéphane esant, Eric Karsenti, Patrick Wincker 2020. Sequencing and data quality control. protocols.io https://dx.doi.org/10.17504/protocols.io.qwjdxcn

Manuscript citation:

Alberti, A. (2017). Viral to metazoan marine plankton nucleotide sequences from the Tara Oceans expedition. Scientific Data  4, 170093 (2017)
doi: 10.1038/sdata.2017.93

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: June 11, 2018

Last Modified: January 09, 2020

Protocol Integer ID: 12971

Keywords: viral to metazoan marine plankton nucleotide sequence, metazoan marine plankton nucleotide sequence, tara oceans expedition, data quality control this protocol, nucleic acids to sequence, data quality control, oceans expedition, nucleic acid, protocol describe, sequence, viral, data, tara, protocol, overview of experimental pipeline

Abstract

This protocol describes the sequencing and data quality control for the Tara Oceans expedition and is part of Viral to metazoan marine plankton nucleotide sequences from the Tara Oceans expedition. 
 
Figure 3: Overview of experimental pipeline from nucleic acids to sequences. (Red crosses highlight QC steps where experiments can be stopped.)

Attachments

Viral to metazoan ma...

1.9MB

technote_Q-Scores.pd...

232KB

Guidelines

1. Sequencing library quality control
All libraries were quantified first by Qubit dsDNA HS Assay measurement and then by qPCR with the KAPA Library Quantification Kit for Illumina Libraries (Kapa Biosystems) on an MXPro instrument (Agilent Technologies). Library profiles were assessed using the DNA High Sensitivity LabChip kit on an Agilent Bioanalyzer. Later on, the quality control step was implemented with quantification by PicoGreen method on 96-well plates and high throughput microfluidic capillary electrophoresis system for library profile analysis (LabChip GX, Perkin Elmer, Waltham, MA).
 
2. Sequencing procedures
Libraries concentrations were normalized to 10 nM by addition of Tris-Cl 10 mM, pH 8.5 and then applied to cluster generation according to the Illumina Cbot User Guide (Part # 15006165). Libraries were sequenced on Genome Analyzer IIx, HiSeq2000 or HiSeq2500 instruments (Illumina) in a paired-end mode. Read lengths were chosen in order to produce data fitting with bioinformatics analyses needs (Table 2). 
 
Table 2: Summary of libraries generated from Tara Oceans DNA and RNA samples and sequencing experiments performed on each type of library.

Metabarcoding and metatranscriptomic libraries were characterized by low diversity sequences at the beginning of the reads related respectively to the presence of primer sequence used to amplify 18S and 16S tags and low complexity polynucleotides added during cDNA synthesis. Low-diversity libraries can interfere in correct cluster identification, resulting in drastic loss of data output. Therefore, loading concentrations of these libraries (8–9 pM instead of 12–14 pM for standard libraries) and PhiX DNA spike-in (10% instead of 1%) were adapted in order to minimize the impacts on the run quality.
 
Sequencing was performed according to the Genome Analyzer IIx User Guide (Part # 15018814), HiSeq2000 System User Guide (Part # 15011190) and HiSeq2500 System User Guide (Part # 15035786).
 
Data quality control and filtering
A first step in data quality control process was the primary analysis performed during the sequencing run by Illumina Real Time Analysis (RTA) software (Code availability 1). This tool analyses images and clusters intensities and filters them to remove low quality data. Furthermore, it performs basecalling and calculates Phred quality score (Q score), which indicates the probability that a given base is called incorrectly. Q score is the most common metric used to assess the accuracy of the sequencing experiment (http://www.illumina.com/documents/products/technotes/technote_Q-Scores.pdf). After conversion of raw BCL files generated by RTA to fastq demultiplexed data by Illumina bcl2fastq Conversion software (Code availability 2), in-house filtering and quality control treatments developed in Genoscope were applied to reads that passed the Illumina quality filters (named raw reads). The parameters of these controls are indicated in Fig.2. 
 
 Figure 2: Data processing flowchart.
 
This processing allows obtaining high quality data and improves subsequent analyses.
 
Filtering steps were applied on whole raw reads as shown in Steps 1-3. 
 
Data quality control was performed on random subsets of 20,000 reads before (raw reads) and/or after filtering steps (clean reads) as shown in Steps 4-8. 
 
Code availability

1. Real Time Analysis software: http://support.illumina.com/sequencing/sequencing_software/real-time_analysis_rta/downloads.html
 
2. Conversion: http://support.illumina.com/sequencing/sequencing_software/bcl2fastq-conversion-software.html
 
3. Fastx_clean software, http://www.genoscope.cns.fr/fastxtend
 
4. FASTX-Toolkit, http://hannonlab.cshl.edu/fastx_toolkit/index.html
 
5. fastx_estimate_duplicate software, http://www.genoscope.cns.fr/fastxtend
 
6. fastx_mergepairs software, http://www.genoscope.cns.fr/fastxtend

Troubleshooting

Filtering

Remove the sequences of the Illumina adapters and primers used during library construction from the whole reads. Remove low quality nucleotides with quality value<20 from both ends. Keep the longest sequence without adapters and low quality bases. Trim sequences between the second unknown nucleotide (N) and the end of the read. Discard reads shorter than 30 nucleotides after trimming. 
Note
These trimming steps are achieved using fastx_clean (Code availability 3), a software based on the FASTX library (Code availability 4).

Remove the reads and their mates that map onto run quality control sequences (Enterobacteria phage PhiX174 genome, Data Citation 1: GenBank NC_001422.1) using SOAP aligner.

Apply a specific filter aiming to remove ribosomal reads to data generated from metatranscriptomic libraries sequencing. In Tara Oceans project, the reads and their mates that map onto a ribosomal sequences database are filtered using SortMeRNA v 1.0 (ref.), a biological sequence analysis tool for filtering, mapping and OTU-picking NGS reads. It contains different rRNA databases and we use it to split the data into two files: rRNA reads in a file (ribo_clean) and other reads in another file (noribo_clean).

Data quality control

Estimate duplicated sequences rates from single and paired sequences on raw reads and cleaned reads (after filtering steps), using fastx_estimate_duplicate (Code availability 5), a  software based on the FASTX library.
Note
The following steps (4, 5 and 6) are performed on a subset of randomized 20,000 reads

Perform taxonomic assignation by aligning with Mega BLAST (Blast 2.2.15 suite) a subset of 20,000 reads against the nt database (http://www.ncbi.nlm.nih.gov/nucleotide), and using Megan software (version 3.9).

Do the merging step with fastx_mergepairs (Code availability 6), a software based on the fastx library. Extract the first 36 nucleotides of read2 and perform alignment between that seed and read1. Launch merging if the alignment was at least of 15 nucleotides, with less than 4 mismatches and an identity percent of at least 90%. For each overlapping position, retain the nucleotide of higher quality.

Final data quality report

Calculate read size, quality values, N positions, base composition and  known adapters sequences detection before (raw reads) and after filtering the reads (cleaned reads). Evaluate each dataset using specific toolboxes generated from this pipeline (see Technical validation paragraph in paper).

Public workspaceSequencing and data quality control

Sequencing and data quality control