Volume 67 Issue 5
May  2023
Turn off MathJax
Article Contents
YAN Weijun, ZHAO Zhengyi, XIONG Xingchuang. A Study on Quality Assessment of Human Genome Data[J]. Metrology Science and Technology, 2023, 67(5): 31-38. doi: 10.12338/j.issn.2096-9015.2023.0089
Citation: YAN Weijun, ZHAO Zhengyi, XIONG Xingchuang. A Study on Quality Assessment of Human Genome Data[J]. Metrology Science and Technology, 2023, 67(5): 31-38. doi: 10.12338/j.issn.2096-9015.2023.0089

A Study on Quality Assessment of Human Genome Data

doi: 10.12338/j.issn.2096-9015.2023.0089
  • Received Date: 2023-03-27
  • Accepted Date: 2023-05-09
  • Rev Recd Date: 2023-05-23
  • Available Online: 2023-08-07
  • Publish Date: 2023-05-31
  • In the wake of the advancements in high-throughput sequencing technology, researchers are now equipped with the capacity to conduct in-depth analyses and processing of human genome sequencing data. The quality of these data inevitably serves as a pivotal factor impacting the credibility of analysis results. As such, precise quality assessment becomes a paramount process to circumvent needless loss and to ascertain the accuracy of outcomes. Both the academic and industrial communities place significant emphasis on data quality assessment, having introduced numerous methods for such assessment and developed a multitude of tools like FastQC and Qualimap software, along with various standard materials and standard reference data, which collectively underpin data quality assessment. However, there are scant systematic investigations of toolsets employed in each assessment stage and summarizations of toolset characteristics. Furthermore, the process of data quality assessment is laden with numerous issues and challenges. To aid human genome data assessment endeavors, this paper delves into potential solutions for these problems and puts forth several practically significant suggestions for reference.
  • loading
  • [1]
    Schloss J A. How to get genomes at one ten-thousandth the cost [J]. Nature Biotechnology, 2008, 26(10): 1113-1115. doi: 10.1038/nbt1008-1113
    [2]
    Reuter J A, Spacek D V, Snyder M P. High-Throughput Sequencing Technologies [J]. Molecular Cell, 2015, 58(4): 586-597. doi: 10.1016/j.molcel.2015.05.004
    [3]
    Hu T, Chitnis N, Monos D, et al. Next-generation sequencing technologies: An overview [J]. Hum Immunol, 2021, 82(11): 801-811. doi: 10.1016/j.humimm.2021.02.012
    [4]
    Endrullat C, Glokler J, Franke P, et al. Standardization and quality management in next-generation sequencing [J]. Appl Transl Genom, 2016, 10: 2-9.
    [5]
    Chen S, Zhou Y, Chen Y, et al. fastp: an ultra-fast all-in-one FASTQ preprocessor [J]. Bioinformatics, 2018, 34(17): i884-i890. doi: 10.1093/bioinformatics/bty560
    [6]
    Wang J, Raskin L, Samuels D C, et al. Genome measures used for quality control are dependent on gene function and ancestry [J]. Bioinformatics, 2015, 31(3): 318-323. doi: 10.1093/bioinformatics/btu668
    [7]
    Paszkiewicz K H, Farbos A, O'neill P, et al. Quality control on the frontier [J]. Front Genet, 2014, 5: 157.
    [8]
    Sprang M, Kruger M, Andrade-Navarro M A, et al. Statistical guidelines for quality control of next-generation sequencing techniques [J]. Life Sci Alliance, 2021, 4(11): 65.
    [9]
    Bedre R, Avila C, Mandadi K. HTSQualC is a flexible and one-step quality control software for high-throughput sequencing data analysis [J]. Sci Rep, 2021, 11(1): 18725. doi: 10.1038/s41598-021-98124-3
    [10]
    Albrecht S, Sprang M, Andrade-Navarro M A, et al. seqQscorer: automated quality control of next-generation sequencing data using machine learning [J]. Genome Biol, 2021, 22(1): 75. doi: 10.1186/s13059-021-02294-2
    [11]
    Institute B. FastQC: A quality control tool for high throughput sequence data [EB/OL]. 2023-05-17.https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
    [12]
    Wingett S W, Andrews S. FastQ Screen: A tool for multi-genome mapping and quality control [J]. F1000Res, 2018, 7: 1338. doi: 10.12688/f1000research.15931.1
    [13]
    Okonechnikov K, Conesa A, Garcia-Alcalde F. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data [J]. Bioinformatics, 2016, 32(2): 292-294. doi: 10.1093/bioinformatics/btv566
    [14]
    He X, Chen S, Li R, et al. Comprehensive fundamental somatic variant calling and quality management strategies for human cancer genomes [J]. Brief Bioinform, 2021, 22(3): 1-15.
    [15]
    Cock P J, Fields C J, Goto N, et al. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants [J]. Nucleic Acids Res, 2010, 38(6): 1767-1771. doi: 10.1093/nar/gkp1137
    [16]
    Ewing B, Green P. Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities [J]. 1998, 8(3): 186-194.
    [17]
    Iso. Genomics informatic — Quality control metrics for DNA sequencing: ISO/TC 215/SC 1 [S]. Genomics Informatics, 2020.
    [18]
    Yang X, Liu D, Liu F, et al. HTQC: a fast quality control toolkit for Illumina sequencing data [J]. BMC Bioinformatics, 2013, 14: 33. doi: 10.1186/1471-2105-14-33
    [19]
    Patel R K, Jain M. NGS QC Toolkit: a toolkit for quality control of next generation sequencing data [J]. PLoS One, 2012, 7(2): e30619. doi: 10.1371/journal.pone.0030619
    [20]
    Cox M P, Peterson D A, Biggs P J. SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data [J]. BMC Bioinformatics, 2010, 11: 485. doi: 10.1186/1471-2105-11-485
    [21]
    Chen Y, Chen Y, Shi C, et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data [J]. Gigascience, 2018, 7(1): 1-6.
    [22]
    Zhang T, Luo Y, Liu K, et al. BIGpre: a quality assessment package for next-generation sequencing data [J]. Genomics Proteomics Bioinformatics, 2011, 9(6): 238-244. doi: 10.1016/S1672-0229(11)60027-2
    [23]
    Yin Z, Zhang H, Liu M, et al. RabbitQC: high-speed scalable quality control for sequencing data [J]. Bioinformatics, 2021, 37(4): 573-574. doi: 10.1093/bioinformatics/btaa719
    [24]
    Alser M, Rotman J, Deshpande D, et al. Technology dictates algorithms: recent developments in read alignment [J]. Genome Biol, 2021, 22(1): 249. doi: 10.1186/s13059-021-02443-7
    [25]
    Canzar S, Salzberg S L. Short Read Mapping: An Algorithmic Tour [J]. Proc IEEE Inst Electr Electron Eng, 2017, 105(3): 436-458. doi: 10.1109/JPROC.2015.2455551
    [26]
    Liao Y, Smyth G K, Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote [J]. Nucleic Acids Res, 2013, 41(10): e108. doi: 10.1093/nar/gkt214
    [27]
    Wilton R, Szalay A S. Performance optimization in DNA short-read alignment [J]. Bioinformatics, 2022, 41(10): e108.
    [28]
    Langmead B, Trapnell C, Pop M, et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome [J]. Genome Biol, 2009, 38(8): 2081-2087.
    [29]
    Langmead B, Salzberg S L. Fast gapped-read alignment with Bowtie 2 [J]. Nat Methods, 2012, 9(4): 357-359. doi: 10.1038/nmeth.1923
    [30]
    Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform [J]. Bioinformatics, 2009, 25(14): 1754-1760. doi: 10.1093/bioinformatics/btp324
    [31]
    Altschul S F, Gish W, Miller W, et al. Basic local alignment search tool [J]. J Mol Biol, 1990, 215(3): 403-410. doi: 10.1016/S0022-2836(05)80360-2
    [32]
    Altschul S F, Madden T L, Schaffer A A, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J]. Nucleic Acids Res, 1997, 25(17): 3389-3402. doi: 10.1093/nar/25.17.3389
    [33]
    Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores [J]. Genome Res, 2008, 18(11): 1851-1858. doi: 10.1101/gr.078212.108
    [34]
    Kim D, Paggi J M, Park C, et al. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype [J]. Nat Biotechnol, 2019, 37(8): 907-915. doi: 10.1038/s41587-019-0201-4
    [35]
    Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools [J]. Bioinformatics, 2009, 25(16): 2078-2079. doi: 10.1093/bioinformatics/btp352
    [36]
    Iso. Biotechnology — Massively parallel sequencing —Part 2: Quality evaluation of sequencing data: ISO/TC 276[S]. Biotechnology, 2021.
    [37]
    Lassmann T, Hayashizaki Y, Daub C O. SAMStat: monitoring biases in next generation sequencing data [J]. Bioinformatics, 2011, 27(1): 130-131. doi: 10.1093/bioinformatics/btq614
    [38]
    Garcia-Alcalde F, Okonechnikov K, Carbonell J, et al. Qualimap: evaluating next-generation sequencing alignment data [J]. Bioinformatics, 2012, 28(20): 2678-2679. doi: 10.1093/bioinformatics/bts503
    [39]
    Li B, Zhan X, Wing M K, et al. QPLOT: a quality assessment tool for next generation sequencing data [J]. Biomed Res Int, 2013, 2013: 865181.
    [40]
    Jun G, Flickinger M, Hetrick K N, et al. Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data [J]. Am J Hum Genet, 2012, 91(5): 839-848. doi: 10.1016/j.ajhg.2012.09.004
    [41]
    Xu C. A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data [J]. Comput Struct Biotechnol J, 2018, 16: 15-24. doi: 10.1016/j.csbj.2018.01.003
    [42]
    Koboldt D C, Zhang Q, Larson D E, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing [J]. Genome Res, 2012, 22(3): 568-576. doi: 10.1101/gr.129684.111
    [43]
    Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data [J]. Bioinformatics, 2011, 27(21): 2987-2993. doi: 10.1093/bioinformatics/btr509
    [44]
    Saunders C T, Wong W S, Swamy S, et al. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs [J]. Bioinformatics, 2012, 28(14): 1811-1817. doi: 10.1093/bioinformatics/bts271
    [45]
    Cibulskis K, Lawrence M S, Carter S L, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples [J]. Nat Biotechnol, 2013, 31(3): 213-219. doi: 10.1038/nbt.2514
    [46]
    Kim S, Scheffler K, Halpern A L, et al. Strelka2: fast and accurate calling of germline and somatic variants [J]. Nat Methods, 2018, 15(8): 591-594. doi: 10.1038/s41592-018-0051-x
    [47]
    Danecek P, Auton A, Abecasis G, et al. The variant call format and VCFtools [J]. Bioinformatics, 2011, 27(15): 2156-2158. doi: 10.1093/bioinformatics/btr330
    [48]
    Krusche P, Trigg L, Boutros P C, et al. Best practices for benchmarking germline small-variant calls in human genomes [J]. Nat Biotechnol, 2019, 37(5): 555-560. doi: 10.1038/s41587-019-0054-x
    [49]
    Cohort F T. 中华家系1号 [EB/OL]. 2023-05-17.https://chinese-quartet.org/.
    [50]
    Zhang F, Kang H M. FASTQuick: rapid and comprehensive quality assessment of raw sequence reads [J]. Gigascience, 2021, 10(2): 143768.
    [51]
    Darby C A, Gaddipati R, Schatz M C, et al. Vargas: heuristic-free alignment for assessing linear and graph read aligners [J]. Bioinformatics, 2020, 36(12): 3712-3718. doi: 10.1093/bioinformatics/btaa265
    [52]
    Wilton R, Budavari T, Langmead B, et al. Arioc: high-throughput read alignment with GPU-accelerated exploration of the seed-and-extend search space [J]. PeerJ, 2015, 3: e808. doi: 10.7717/peerj.808
    [53]
    Piccolo S R, Frampton M B. Tools and techniques for computational reproducibility [J]. Gigascience, 2016, 5(1): 30. doi: 10.1186/s13742-016-0135-4
    [54]
    Garijo D, Kinnings S, Xie L, et al. Quantifying reproducibility in computational biology: the case of the tuberculosis drugome [J]. PLoS One, 2013, 8(11): e80278. doi: 10.1371/journal.pone.0080278
    [55]
    Loman N, Watson M. So you want to be a computational biologist? [J]. Nature Biotechnology, 2013, 31(11): 996-998. doi: 10.1038/nbt.2740
    [56]
    Bray N L, Pimentel H, Melsted P, et al. Near-optimal probabilistic RNA-seq quantification [J]. Nature Biotechnology, 2016, 34(5): 525-527. doi: 10.1038/nbt.3519
    [57]
    Steinbiss S, Silva-Franco F, Brunk B, et al. Companion: a web server for annotation and analysis of parasite genomes [J]. Nucleic Acids Res, 2016, 44(W1): W29-W34. doi: 10.1093/nar/gkw292
    [58]
    Jun G, Wing M K, Abecasis G R, et al. An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data [J]. Genome Res, 2015, 25(6): 918-925. doi: 10.1101/gr.176552.114
    [59]
    Di Tommaso P, Chatzou M, Floden E W, et al. Nextflow enables reproducible computational workflows [J]. Nat Biotechnol, 2017, 35(4): 316-319. doi: 10.1038/nbt.3820
    [60]
    Vivian J, Rao A A, Nothaft F A, et al. Toil enables reproducible, open source, big biomedical data analyses [J]. Nat Biotechnol, 2017, 35(4): 314-316. doi: 10.1038/nbt.3772
    [61]
    Molder F, Jablonski K P, Letcher B, et al. Sustainable data analysis with Snakemake [J]. F1000Res, 2021, 10: 33. doi: 10.12688/f1000research.29032.2
    [62]
    Galaxy C. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update [J]. Nucleic Acids Res , 2016(W1): W3-W10.
    [63]
    Oliver H, Shin M, Matthews D, et al. Workflow Automation for Cycling Systems [J]. Computing in Science & Engineering, 2019, 21(4): 7-21.
    [64]
    Li J, Jew B, Zhan L, et al. ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest [J]. PLoS Comput Biol, 2019, 15(12): e1007556. doi: 10.1371/journal.pcbi.1007556
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(1)  / Tables(7)

    Article Metrics

    Article views (136) PDF downloads(24) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return