Jan 09, 2020

Public workspaceSequencing and data quality control

Sequencing and data quality control
  • Adriana Alberti1,
  • Julie Poulain2,
  • Stefan Engelen2,
  • Karine Labadie2,
  • Sarah Romac3,4,
  • Isabel Ferrera5,
  • Guillaume Albini2,
  • Jean-Marc Aury2,
  • Caroline Belser2,
  • Alexis Bertrand2,
  • Corinne Cruaud2,
  • Corinne Da Silva2,
  • Carole Dossat2,
  • Frédéric vory2,
  • Shahinaz Gas2,
  • Julie Guy2,
  • Maud Haquelle2,
  • E'krame Jacoby2,
  • Olivier Jaillon2,6,7,
  • Arnaud Lemainque2,
  • Eric Pelletier2,
  • Gaëlle amson2,
  • Marc Wessner2,
  • Genoscope Technical Team2,
  • Silvia G. Acinas5,
  • Marta Royo-Llonch5,
  • Francisco M. Cornejo-Castillo5,
  • Ramiro Logares5,
  • Beatriz Fernández-Gómez5,8,9,
  • Chris Bowler10,
  • Guy Cochrane11,
  • Clara Amid11,
  • Petra Ten Hoopen11,
  • Colomban De Vargas3,4,
  • Nigel Grimsley12,13,
  • Elodie Desgranges12,13,
  • Stefanie Kandels-Lewis14,15,
  • Hiroyuki Ogata16,
  • Nicole Poulton17,
  • Michael E. Sieracki17,18,
  • Ramunas Stepanauskas17,
  • Matthew B. Sullivan19,20,
  • Jennifer R. Brum20,21,
  • Melissa B. Duhaime22,
  • Bonnie T. Poulos23,
  • Bonnie L. Hurwitz24,
  • Stéphane esant25,26,
  • Eric Karsenti10,14,27,
  • Patrick Wincker2,6,7
  • 1CEA, Institut de Biologie Intégrative de la Cellule;
  • 2CEA - Institut de Biologie François Jacob, Genoscope, Evry, France;
  • 3CNRS, UMR 7144, Station Biologique de Roscoff, France;
  • 4Sorbonne Universités, UPMC Univ Paris 06, UMR 7144, Station Biologique de Roscoff, France;
  • 5Departament de Biologia Marina i Oceanografia, Institute of Marine Sciences (ICM), CSIC, Barcelona, Spain;
  • 6CNRS, UMR 8030, Evry , France;
  • 7Université d'Evry, UMR 8030, Evry, France;
  • 8FONDAP Center for Genome Regulation, Santiago, Chile;
  • 9Laboratorio de Bioinformática y Expresión Génica, Instituto de Nutrición y Tecnología de los Alimentos (INTA), Universidad de Chile, El Libano Macul, Santiago, Chile;
  • 10Ecole Normale Supérieure, PSL Research University, Institut de Biologie de l’Ecole Normale Supérieure (IBENS), CNRS UMR 8197, INSERM U1024, Paris, France;
  • 11European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genomes Campus, Hinxton, Cambridge , UK;
  • 12CNRS UMR 7232, BIOM, Banyuls-sur-Mer, France;
  • 13Sorbonne Universités Paris 06, OOB UPMC, Banyuls-sur-Mer , France;
  • 14Directors’ Research European Molecular Biology Laboratory, Heidelberg, Germany;
  • 15Structural and Computational Biology, European Molecular Biology Laboratory, Heidelberg, Germany;
  • 16Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, Japan;
  • 17Bigelow Laboratory for Ocean Sciences, East Boothbay, Maine, USA;
  • 18National Science Foundation, Arlington, Virginia, USA;
  • 19Departments of Microbiology and Civil, Environmental and Geodetic Engineering, Ohio State University, Columbus, Ohio, USA;
  • 20Department of Microbiology, The Ohio State University, Columbus, Ohio, USA;
  • 21Present address: Department of Oceanography and Coastal Sciences, Louisiana State University, Baton Rouge, Louisiana, USA;
  • 22Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, USA;
  • 23University of Arizona, Tucson, Arizona, USA;
  • 24Department of Agricultural and Biosystems Engineering, University of Arizona, Tucson, Arizona, USA;
  • 25MARUM, Center for Marine Environmental Sciences, University of Bremen, Germany;
  • 26PANGAEA, Data Publisher for Earth and Environmental Science, University of Bremen, Germany;
  • 27Sorbonne Universités, UPMC Université Paris 06, CNRS, Laboratoire d’oceanographie de Villefranche (LOV), Observatoire Océanologique, Villefranche-sur-mer, France
  • Tara Oceans
Icon indicating open access to content
QR code linking to this content
Protocol CitationAdriana Alberti, Julie Poulain, Stefan Engelen, Karine Labadie, Sarah Romac, Isabel Ferrera, Guillaume Albini, Jean-Marc Aury, Caroline Belser, Alexis Bertrand, Corinne Cruaud, Corinne Da Silva, Carole Dossat, Frédéric vory, Shahinaz Gas, Julie Guy, Maud Haquelle, E'krame Jacoby, Olivier Jaillon, Arnaud Lemainque, Eric Pelletier, Gaëlle amson, Marc Wessner, Genoscope Technical Team, Silvia G. Acinas, Marta Royo-Llonch, Francisco M. Cornejo-Castillo, Ramiro Logares, Beatriz Fernández-Gómez, Chris Bowler, Guy Cochrane, Clara Amid, Petra Ten Hoopen, Colomban De Vargas, Nigel Grimsley, Elodie Desgranges, Stefanie Kandels-Lewis, Hiroyuki Ogata, Nicole Poulton, Michael E. Sieracki, Ramunas Stepanauskas, Matthew B. Sullivan, Jennifer R. Brum, Melissa B. Duhaime, Bonnie T. Poulos, Bonnie L. Hurwitz, Stéphane esant, Eric Karsenti, Patrick Wincker 2020. Sequencing and data quality control. protocols.io https://dx.doi.org/10.17504/protocols.io.qwjdxcn
Manuscript citation:
Alberti, A. (2017). Viral to metazoan marine plankton nucleotide sequences from the Tara Oceans expedition. Scientific Data  4, 170093 (2017) doi: 10.1038/sdata.2017.93
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: June 11, 2018
Last Modified: January 09, 2020
Protocol Integer ID: 12971
Keywords: viral to metazoan marine plankton nucleotide sequence, metazoan marine plankton nucleotide sequence, tara oceans expedition, data quality control this protocol, nucleic acids to sequence, data quality control, oceans expedition, nucleic acid, protocol describe, sequence, viral, data, tara, protocol, overview of experimental pipeline
Abstract
This protocol describes the sequencing and data quality control for the Tara Oceans expedition and is part of Viral to metazoan marine plankton nucleotide sequences from the Tara Oceans expedition


Figure 3: Overview of experimental pipeline from nucleic acids to sequences. (Red crosses highlight QC steps where experiments can be stopped.)
Guidelines
1. Sequencing library quality control
All libraries were quantified first by Qubit dsDNA HS Assay measurement and then by qPCR with the KAPA Library Quantification Kit for Illumina Libraries (Kapa Biosystems) on an MXPro instrument (Agilent Technologies). Library profiles were assessed using the DNA High Sensitivity LabChip kit on an Agilent Bioanalyzer. Later on, the quality control step was implemented with quantification by PicoGreen method on 96-well plates and high throughput microfluidic capillary electrophoresis system for library profile analysis (LabChip GX, Perkin Elmer, Waltham, MA).
2. Sequencing procedures
Libraries concentrations were normalized to 10 nM by addition of Tris-Cl 10 mM, pH 8.5 and then applied to cluster generation according to the Illumina Cbot User Guide (Part # 15006165). Libraries were sequenced on Genome Analyzer IIx, HiSeq2000 or HiSeq2500 instruments (Illumina) in a paired-end mode. Read lengths were chosen in order to produce data fitting with bioinformatics analyses needs (Table 2). 
Table 2: Summary of libraries generated from Tara Oceans DNA and RNA samples and sequencing experiments performed on each type of library.


Metabarcoding and metatranscriptomic libraries were characterized by low diversity sequences at the beginning of the reads related respectively to the presence of primer sequence used to amplify 18S and 16S tags and low complexity polynucleotides added during cDNA synthesis. Low-diversity libraries can interfere in correct cluster identification, resulting in drastic loss of data output. Therefore, loading concentrations of these libraries (8–9 pM instead of 12–14 pM for standard libraries) and PhiX DNA spike-in (10% instead of 1%) were adapted in order to minimize the impacts on the run quality.
Sequencing was performed according to the Genome Analyzer IIx User Guide (Part # 15018814), HiSeq2000 System User Guide (Part # 15011190) and HiSeq2500 System User Guide (Part # 15035786).
Data quality control and filtering
A first step in data quality control process was the primary analysis performed during the sequencing run by Illumina Real Time Analysis (RTA) software (Code availability 1). This tool analyses images and clusters intensities and filters them to remove low quality data. Furthermore, it performs basecalling and calculates Phred quality score (Q score), which indicates the probability that a given base is called incorrectly. Q score is the most common metric used to assess the accuracy of the sequencing experiment (http://www.illumina.com/documents/products/technotes/technote_Q-Scores.pdf). After conversion of raw BCL files generated by RTA to fastq demultiplexed data by Illumina bcl2fastq Conversion software (Code availability 2), in-house filtering and quality control treatments developed in Genoscope were applied to reads that passed the Illumina quality filters (named raw reads). The parameters of these controls are indicated in Fig.2. 


 Figure 2: Data processing flowchart.
This processing allows obtaining high quality data and improves subsequent analyses.
Filtering steps were applied on whole raw reads as shown in Steps 1-3
Data quality control was performed on random subsets of 20,000 reads before (raw reads) and/or after filtering steps (clean reads) as shown in Steps 4-8
Code availability
3. Fastx_clean software, http://www.genoscope.cns.fr/fastxtend
5. fastx_estimate_duplicate software, http://www.genoscope.cns.fr/fastxtend
6. fastx_mergepairs software, http://www.genoscope.cns.fr/fastxtend
Troubleshooting
Filtering
Remove the sequences of the Illumina adapters and primers used during library construction from the whole reads. Remove low quality nucleotides with quality value<20 from both ends. Keep the longest sequence without adapters and low quality bases. Trim sequences between the second unknown nucleotide (N) and the end of the read. Discard reads shorter than 30 nucleotides after trimming. 
Note
These trimming steps are achieved using fastx_clean (Code availability 3), a software based on the FASTX library (Code availability 4).
Remove the reads and their mates that map onto run quality control sequences (Enterobacteria phage PhiX174 genome, Data Citation 1: GenBank NC_001422.1) using SOAP aligner.
Apply a specific filter aiming to remove ribosomal reads to data generated from metatranscriptomic libraries sequencing. In Tara Oceans project, the reads and their mates that map onto a ribosomal sequences database are filtered using SortMeRNA v 1.0 (ref.), a biological sequence analysis tool for filtering, mapping and OTU-picking NGS reads. It contains different rRNA databases and we use it to split the data into two files: rRNA reads in a file (ribo_clean) and other reads in another file (noribo_clean).
Data quality control
Estimate duplicated sequences rates from single and paired sequences on raw reads and cleaned reads (after filtering steps), using fastx_estimate_duplicate (Code availability 5), a  software based on the FASTX library.
Note
The following steps (4, 5 and 6) are performed on a subset of randomized 20,000 reads
Perform taxonomic assignation by aligning with Mega BLAST (Blast 2.2.15 suite) a subset of 20,000 reads against the nt database (http://www.ncbi.nlm.nih.gov/nucleotide), and using Megan software (version 3.9).
Do the merging step with fastx_mergepairs (Code availability 6), a software based on the fastx library. Extract the first 36 nucleotides of read2 and perform alignment between that seed and read1. Launch merging if the alignment was at least of 15 nucleotides, with less than 4 mismatches and an identity percent of at least 90%. For each overlapping position, retain the nucleotide of higher quality.
Final data quality report
Calculate read size, quality values, N positions, base composition and  known adapters sequences detection before (raw reads) and after filtering the reads (cleaned reads). Evaluate each dataset using specific toolboxes generated from this pipeline (see Technical validation paragraph in paper).