Step1: remove low quality reads:
We firstly calculate the accuracy probabilities of each base using the following equations:
WhereQ is thePhred quality score of each base,Eis the error probability of each base.
Then we calculate the overall accuracy probability(OA)of each readusing the following equation:
For each read, an initial 30-mer seed sequence is selected at the 5’ end of the read and its overall accuracy, defined as OAseed, is calculated. To ensure high data quality, OAseedis defined as 0.9 usingm equal to 0with zero low quality bases allowed. Once the seed position of the read has been defined, the seed would extend to keep the longest contiguous read fragment in which the OA, defined as OAfragment, is above a defined accuracy threshold. In this study, weset OAfragmentequal to or greater than 0.8 usingmequal to 1.
And the Perl scripts for overall accuracy based QC pipeline are freely available for download and reuse from Github (https://github.com/Scelta/OAFilter).