Volume 67 Issue 5
May  2023
Turn off MathJax
Article Contents
YAN Weijun, ZHAO Zhengyi, XIONG Xingchuang. A Study on Quality Assessment of Human Genome Data[J]. Metrology Science and Technology, 2023, 67(5): 31-38. doi: 10.12338/j.issn.2096-9015.2023.0089
Citation: YAN Weijun, ZHAO Zhengyi, XIONG Xingchuang. A Study on Quality Assessment of Human Genome Data[J]. Metrology Science and Technology, 2023, 67(5): 31-38. doi: 10.12338/j.issn.2096-9015.2023.0089

A Study on Quality Assessment of Human Genome Data

doi: 10.12338/j.issn.2096-9015.2023.0089
  • Received Date: 2023-03-27
  • Accepted Date: 2023-05-09
  • Rev Recd Date: 2023-05-23
  • Available Online: 2023-08-07
  • Publish Date: 2023-05-31
  • In the wake of the advancements in high-throughput sequencing technology, researchers are now equipped with the capacity to conduct in-depth analyses and processing of human genome sequencing data. The quality of these data inevitably serves as a pivotal factor impacting the credibility of analysis results. As such, precise quality assessment becomes a paramount process to circumvent needless loss and to ascertain the accuracy of outcomes. Both the academic and industrial communities place significant emphasis on data quality assessment, having introduced numerous methods for such assessment and developed a multitude of tools like FastQC and Qualimap software, along with various standard materials and standard reference data, which collectively underpin data quality assessment. However, there are scant systematic investigations of toolsets employed in each assessment stage and summarizations of toolset characteristics. Furthermore, the process of data quality assessment is laden with numerous issues and challenges. To aid human genome data assessment endeavors, this paper delves into potential solutions for these problems and puts forth several practically significant suggestions for reference.
  • loading
  • [1]
    Schloss J A. How to get genomes at one ten-thousandth the cost [J]. Nature Biotechnology, 2008, 26(10): 1113-1115. doi: 10.1038/nbt1008-1113
    Reuter J A, Spacek D V, Snyder M P. High-Throughput Sequencing Technologies [J]. Molecular Cell, 2015, 58(4): 586-597. doi: 10.1016/j.molcel.2015.05.004
    Hu T, Chitnis N, Monos D, et al. Next-generation sequencing technologies: An overview [J]. Hum Immunol, 2021, 82(11): 801-811. doi: 10.1016/j.humimm.2021.02.012
    Endrullat C, Glokler J, Franke P, et al. Standardization and quality management in next-generation sequencing [J]. Appl Transl Genom, 2016, 10: 2-9.
    Chen S, Zhou Y, Chen Y, et al. fastp: an ultra-fast all-in-one FASTQ preprocessor [J]. Bioinformatics, 2018, 34(17): i884-i890. doi: 10.1093/bioinformatics/bty560
    Wang J, Raskin L, Samuels D C, et al. Genome measures used for quality control are dependent on gene function and ancestry [J]. Bioinformatics, 2015, 31(3): 318-323. doi: 10.1093/bioinformatics/btu668
    Paszkiewicz K H, Farbos A, O'neill P, et al. Quality control on the frontier [J]. Front Genet, 2014, 5: 157.
    Sprang M, Kruger M, Andrade-Navarro M A, et al. Statistical guidelines for quality control of next-generation sequencing techniques [J]. Life Sci Alliance, 2021, 4(11): 65.
    Bedre R, Avila C, Mandadi K. HTSQualC is a flexible and one-step quality control software for high-throughput sequencing data analysis [J]. Sci Rep, 2021, 11(1): 18725. doi: 10.1038/s41598-021-98124-3
    Albrecht S, Sprang M, Andrade-Navarro M A, et al. seqQscorer: automated quality control of next-generation sequencing data using machine learning [J]. Genome Biol, 2021, 22(1): 75. doi: 10.1186/s13059-021-02294-2
    Institute B. FastQC: A quality control tool for high throughput sequence data [EB/OL]. 2023-05-17.https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
    Wingett S W, Andrews S. FastQ Screen: A tool for multi-genome mapping and quality control [J]. F1000Res, 2018, 7: 1338. doi: 10.12688/f1000research.15931.1
    Okonechnikov K, Conesa A, Garcia-Alcalde F. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data [J]. Bioinformatics, 2016, 32(2): 292-294. doi: 10.1093/bioinformatics/btv566
    He X, Chen S, Li R, et al. Comprehensive fundamental somatic variant calling and quality management strategies for human cancer genomes [J]. Brief Bioinform, 2021, 22(3): 1-15.
    Cock P J, Fields C J, Goto N, et al. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants [J]. Nucleic Acids Res, 2010, 38(6): 1767-1771. doi: 10.1093/nar/gkp1137
    Ewing B, Green P. Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities [J]. 1998, 8(3): 186-194.
    Iso. Genomics informatic — Quality control metrics for DNA sequencing: ISO/TC 215/SC 1 [S]. Genomics Informatics, 2020.
    Yang X, Liu D, Liu F, et al. HTQC: a fast quality control toolkit for Illumina sequencing data [J]. BMC Bioinformatics, 2013, 14: 33. doi: 10.1186/1471-2105-14-33
    Patel R K, Jain M. NGS QC Toolkit: a toolkit for quality control of next generation sequencing data [J]. PLoS One, 2012, 7(2): e30619. doi: 10.1371/journal.pone.0030619
    Cox M P, Peterson D A, Biggs P J. SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data [J]. BMC Bioinformatics, 2010, 11: 485. doi: 10.1186/1471-2105-11-485
    Chen Y, Chen Y, Shi C, et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data [J]. Gigascience, 2018, 7(1): 1-6.
    Zhang T, Luo Y, Liu K, et al. BIGpre: a quality assessment package for next-generation sequencing data [J]. Genomics Proteomics Bioinformatics, 2011, 9(6): 238-244. doi: 10.1016/S1672-0229(11)60027-2
    Yin Z, Zhang H, Liu M, et al. RabbitQC: high-speed scalable quality control for sequencing data [J]. Bioinformatics, 2021, 37(4): 573-574. doi: 10.1093/bioinformatics/btaa719
    Alser M, Rotman J, Deshpande D, et al. Technology dictates algorithms: recent developments in read alignment [J]. Genome Biol, 2021, 22(1): 249. doi: 10.1186/s13059-021-02443-7
    Canzar S, Salzberg S L. Short Read Mapping: An Algorithmic Tour [J]. Proc IEEE Inst Electr Electron Eng, 2017, 105(3): 436-458. doi: 10.1109/JPROC.2015.2455551
    Liao Y, Smyth G K, Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote [J]. Nucleic Acids Res, 2013, 41(10): e108. doi: 10.1093/nar/gkt214
    Wilton R, Szalay A S. Performance optimization in DNA short-read alignment [J]. Bioinformatics, 2022, 41(10): e108.
    Langmead B, Trapnell C, Pop M, et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome [J]. Genome Biol, 2009, 38(8): 2081-2087.
    Langmead B, Salzberg S L. Fast gapped-read alignment with Bowtie 2 [J]. Nat Methods, 2012, 9(4): 357-359. doi: 10.1038/nmeth.1923
    Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform [J]. Bioinformatics, 2009, 25(14): 1754-1760. doi: 10.1093/bioinformatics/btp324
    Altschul S F, Gish W, Miller W, et al. Basic local alignment search tool [J]. J Mol Biol, 1990, 215(3): 403-410. doi: 10.1016/S0022-2836(05)80360-2
    Altschul S F, Madden T L, Schaffer A A, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J]. Nucleic Acids Res, 1997, 25(17): 3389-3402. doi: 10.1093/nar/25.17.3389
    Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores [J]. Genome Res, 2008, 18(11): 1851-1858. doi: 10.1101/gr.078212.108
    Kim D, Paggi J M, Park C, et al. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype [J]. Nat Biotechnol, 2019, 37(8): 907-915. doi: 10.1038/s41587-019-0201-4
    Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools [J]. Bioinformatics, 2009, 25(16): 2078-2079. doi: 10.1093/bioinformatics/btp352
    Iso. Biotechnology — Massively parallel sequencing —Part 2: Quality evaluation of sequencing data: ISO/TC 276[S]. Biotechnology, 2021.
    Lassmann T, Hayashizaki Y, Daub C O. SAMStat: monitoring biases in next generation sequencing data [J]. Bioinformatics, 2011, 27(1): 130-131. doi: 10.1093/bioinformatics/btq614
    Garcia-Alcalde F, Okonechnikov K, Carbonell J, et al. Qualimap: evaluating next-generation sequencing alignment data [J]. Bioinformatics, 2012, 28(20): 2678-2679. doi: 10.1093/bioinformatics/bts503
    Li B, Zhan X, Wing M K, et al. QPLOT: a quality assessment tool for next generation sequencing data [J]. Biomed Res Int, 2013, 2013: 865181.
    Jun G, Flickinger M, Hetrick K N, et al. Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data [J]. Am J Hum Genet, 2012, 91(5): 839-848. doi: 10.1016/j.ajhg.2012.09.004
    Xu C. A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data [J]. Comput Struct Biotechnol J, 2018, 16: 15-24. doi: 10.1016/j.csbj.2018.01.003
    Koboldt D C, Zhang Q, Larson D E, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing [J]. Genome Res, 2012, 22(3): 568-576. doi: 10.1101/gr.129684.111
    Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data [J]. Bioinformatics, 2011, 27(21): 2987-2993. doi: 10.1093/bioinformatics/btr509
    Saunders C T, Wong W S, Swamy S, et al. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs [J]. Bioinformatics, 2012, 28(14): 1811-1817. doi: 10.1093/bioinformatics/bts271
    Cibulskis K, Lawrence M S, Carter S L, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples [J]. Nat Biotechnol, 2013, 31(3): 213-219. doi: 10.1038/nbt.2514
    Kim S, Scheffler K, Halpern A L, et al. Strelka2: fast and accurate calling of germline and somatic variants [J]. Nat Methods, 2018, 15(8): 591-594. doi: 10.1038/s41592-018-0051-x
    Danecek P, Auton A, Abecasis G, et al. The variant call format and VCFtools [J]. Bioinformatics, 2011, 27(15): 2156-2158. doi: 10.1093/bioinformatics/btr330
    Krusche P, Trigg L, Boutros P C, et al. Best practices for benchmarking germline small-variant calls in human genomes [J]. Nat Biotechnol, 2019, 37(5): 555-560. doi: 10.1038/s41587-019-0054-x
    Cohort F T. 中华家系1号 [EB/OL]. 2023-05-17.https://chinese-quartet.org/.
    Zhang F, Kang H M. FASTQuick: rapid and comprehensive quality assessment of raw sequence reads [J]. Gigascience, 2021, 10(2): 143768.
    Darby C A, Gaddipati R, Schatz M C, et al. Vargas: heuristic-free alignment for assessing linear and graph read aligners [J]. Bioinformatics, 2020, 36(12): 3712-3718. doi: 10.1093/bioinformatics/btaa265
    Wilton R, Budavari T, Langmead B, et al. Arioc: high-throughput read alignment with GPU-accelerated exploration of the seed-and-extend search space [J]. PeerJ, 2015, 3: e808. doi: 10.7717/peerj.808
    Piccolo S R, Frampton M B. Tools and techniques for computational reproducibility [J]. Gigascience, 2016, 5(1): 30. doi: 10.1186/s13742-016-0135-4
    Garijo D, Kinnings S, Xie L, et al. Quantifying reproducibility in computational biology: the case of the tuberculosis drugome [J]. PLoS One, 2013, 8(11): e80278. doi: 10.1371/journal.pone.0080278
    Loman N, Watson M. So you want to be a computational biologist? [J]. Nature Biotechnology, 2013, 31(11): 996-998. doi: 10.1038/nbt.2740
    Bray N L, Pimentel H, Melsted P, et al. Near-optimal probabilistic RNA-seq quantification [J]. Nature Biotechnology, 2016, 34(5): 525-527. doi: 10.1038/nbt.3519
    Steinbiss S, Silva-Franco F, Brunk B, et al. Companion: a web server for annotation and analysis of parasite genomes [J]. Nucleic Acids Res, 2016, 44(W1): W29-W34. doi: 10.1093/nar/gkw292
    Jun G, Wing M K, Abecasis G R, et al. An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data [J]. Genome Res, 2015, 25(6): 918-925. doi: 10.1101/gr.176552.114
    Di Tommaso P, Chatzou M, Floden E W, et al. Nextflow enables reproducible computational workflows [J]. Nat Biotechnol, 2017, 35(4): 316-319. doi: 10.1038/nbt.3820
    Vivian J, Rao A A, Nothaft F A, et al. Toil enables reproducible, open source, big biomedical data analyses [J]. Nat Biotechnol, 2017, 35(4): 314-316. doi: 10.1038/nbt.3772
    Molder F, Jablonski K P, Letcher B, et al. Sustainable data analysis with Snakemake [J]. F1000Res, 2021, 10: 33. doi: 10.12688/f1000research.29032.2
    Galaxy C. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update [J]. Nucleic Acids Res , 2016(W1): W3-W10.
    Oliver H, Shin M, Matthews D, et al. Workflow Automation for Cycling Systems [J]. Computing in Science & Engineering, 2019, 21(4): 7-21.
    Li J, Jew B, Zhan L, et al. ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest [J]. PLoS Comput Biol, 2019, 15(12): e1007556. doi: 10.1371/journal.pcbi.1007556
  • 加载中


    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(1)  / Tables(7)

    Article Metrics

    Article views (136) PDF downloads(24) Cited by()
    Proportional views


    DownLoad:  Full-Size Img  PowerPoint