留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

人类基因组数据质量评估研究

严维军 赵正宜 熊行创

严维军,赵正宜,熊行创. 人类基因组数据质量评估研究[J]. 计量科学与技术,2023, 67(5): 31-38 doi: 10.12338/j.issn.2096-9015.2023.0089
引用本文: 严维军,赵正宜,熊行创. 人类基因组数据质量评估研究[J]. 计量科学与技术,2023, 67(5): 31-38 doi: 10.12338/j.issn.2096-9015.2023.0089
YAN Weijun, ZHAO Zhengyi, XIONG Xingchuang. A Study on Quality Assessment of Human Genome Data[J]. Metrology Science and Technology, 2023, 67(5): 31-38. doi: 10.12338/j.issn.2096-9015.2023.0089
Citation: YAN Weijun, ZHAO Zhengyi, XIONG Xingchuang. A Study on Quality Assessment of Human Genome Data[J]. Metrology Science and Technology, 2023, 67(5): 31-38. doi: 10.12338/j.issn.2096-9015.2023.0089

人类基因组数据质量评估研究

doi: 10.12338/j.issn.2096-9015.2023.0089
基金项目: 国家重点研发计划(2021YFF0600100)。
详细信息
    作者简介:

    严维军(1998-),中国计量科学研究院在读研究生,研究方向:人类基因组数据质量评估,邮箱:yanweijun@nim.ac.cn

    通讯作者:

    熊行创(1979-),中国计量科学研究院研究员,研究方向:测试计量数据质量评估,邮箱: xiongxch@nim.ac.cn

  • 中图分类号: TB99

A Study on Quality Assessment of Human Genome Data

  • 摘要: 随着高通量测序技术的发展,研究人员现已具备对人类基因组测序数据进行深度分析和处理的能力,数据质量无疑成为影响数据分析结果可信度的决定性因素。因此,精确的数据质量评估成为至关重要的环节,其目的在于避免不必要的损失并确保结果的准确性。学术界和产业界都高度重视数据质量的评估,提出了大量的质量评估方法并开发了大量的工具,例如FastQC、Qualimap等软件工具,以及各类标准物质和标准参考数据,为数据质量评估提供了有力支持。然而,系统的研究各个质量评估环节的工具集以及对各类工具集的特点汇总相对较少,数据的质量评估的过程仍存在诸多问题和挑战。为评估人类基因组数据工作提供帮助,深入分析了上述问题的解决策略,并提供了一些具有实践意义的建议,以期提供参考。
  • 图  1  人类基因组测序数据质量评估总流程图

    Figure  1.  Comprehensive flowchart of human genome sequencing data quality assessment

    表  1  各类对齐前质量评估工具的评估指标比较

    Table  1.   Comparison of evaluation metrics for various pre-alignment quality assessment tools

    评估指标 FastQC fastp NGS QC Toolkit HTQC SolexaQA SOAPnuke BIGpre FastQ Screen
    总读取数量
    读长分布
    碱基分布
    GC含量
    质量分数
    接头序列的污染
    其他物种的污染
    序列重复水平
    过度表达的序列
    k-mer分析
    下载: 导出CSV

    表  2  各类对齐前质量评估工具的特点和下载链接

    Table  2.   Characteristics and download links for various pre-alignment quality assessment tools

    工具 特点 下载链接
    FastQC 评估指标较为全面 https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
    HTQC 评估、序列修剪 https://sourceforge.net/projects/htqc/
    NGS QC Toolkit 评估、序列修剪 https://github.com/mjain-lab/NGSQCToolkit
    fastp 双端测序评估、修剪 https://github.com/OpenGene/fastp
    SolexaQA 序列根据质量分类 https://solexaqa.sourceforge.net/
    BIGpre 检测、处理重复序列 http://bigpre.sourceforge.net/
    FastQ Screen 污染评估 https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/
    SOAPnuke MapReduce加速 https://github.com/BGI-flexlab/SOAPnuke
    RabbitQC 充分利用硬件加速 https://github.com/ZekunYin/RabbitQC
    下载: 导出CSV

    表  3  流行的短读取对齐工具

    Table  3.   Popular tools for short read alignment

    工具 优点 下载链接
    Bowtie2 支持间隙、局部和双端对齐模式 https://github.com/BenLangmead/bowtie2
    BWA-MEM2 使用 FM 索引和8x压缩 https://github.com/bwa-mem2/bwa-mem2
    Gapped BLAST 支持间隙对齐 https://blast.ncbi.nlm.nih.gov/Blast.cgi
    Subread 采用种子和投票策略 https://subread.sourceforge.net/
    HISAT2 采用分层索引 https://github.com/DaehwanKimLab/hisat2
    下载: 导出CSV

    表  4  各类对齐后质量评估工具的评估指标

    Table  4.   Evaluation metrics for various post-alignment quality assessment tools

    评估指标 Picard QPLOT Qualimap2 SAMstat verifyBamID
    映射率
    插入尺寸
    序列重复水平
    碱基分布
    映射质量
    覆盖深度
    GC含量
    错配率
    覆盖深度
    覆盖率
    污染估计
    下载: 导出CSV

    表  5  各类对齐后质量评估工具的特点和下载链接

    Table  5.   Characteristics and download links for various post-alignment quality assessment tools

    工具 特点 下载链接
    SAMstat 评估指标统计 https://samstat.sourceforge.net/
    QPLOT 评估指标统计 https://github.com/statgen/qplot
    Qualimap2 多样本处理 http://qualimap.conesalab.org/
    Picard 自定义所需评估指标 https://github.com/broadinstitute/picard
    verifyBamID 检测污染 https://github.com/Griffan/VerifyBamID
    下载: 导出CSV

    表  6  流行的短读取对齐工具

    Table  6.   Popular tools for short read alignment

    工具 方法 下载链接
    VarScan2 启发式方法 https://github.com/dkoboldt/varscan
    SomaticSniper 联合基因型分析 https://github.com/genome/somatic-sniper
    SAMtools 联合基因型分析 https://github.com/samtools/samtools
    Strelka 等位基因频率分析 https://github.com/target/strelka
    MuTect 等位基因频率分析 https://github.com/broadinstitute/mutect
    MuTect2 单倍型模型 https://github.com/broadinstitute/gatk
    FreeBayes 单倍型模型 https://github.com/freebayes/freebayes
    Strelka2 分层单倍型模型 https://github.com/Illumina/strelka
    下载: 导出CSV

    表  7  各类变异可信度质量评估工具的特点和下载链接

    Table  7.   Characteristics and download links for various mutation confidence quality assessment tools

    工具 特点 下载链接
    hap.py 将 VCF 与标准数据集进行比较 https://github.com/Illumina/hap.py
    rtg-tools 在单倍型水平上进行变异比较 https://github.com/RealTimeGenomics/rtg-tools/
    vgraph 使用变异图比较遗传变异 https://github.com/bioinformed/vgraph/
    VBT-TrioAnalysis 变体比较和孟德尔违规检测 https://github.com/sbg/VBT-TrioAnalysis
    下载: 导出CSV
  • [1] Schloss J A. How to get genomes at one ten-thousandth the cost [J]. Nature Biotechnology, 2008, 26(10): 1113-1115. doi: 10.1038/nbt1008-1113
    [2] Reuter J A, Spacek D V, Snyder M P. High-Throughput Sequencing Technologies [J]. Molecular Cell, 2015, 58(4): 586-597. doi: 10.1016/j.molcel.2015.05.004
    [3] Hu T, Chitnis N, Monos D, et al. Next-generation sequencing technologies: An overview [J]. Hum Immunol, 2021, 82(11): 801-811. doi: 10.1016/j.humimm.2021.02.012
    [4] Endrullat C, Glokler J, Franke P, et al. Standardization and quality management in next-generation sequencing [J]. Appl Transl Genom, 2016, 10: 2-9.
    [5] Chen S, Zhou Y, Chen Y, et al. fastp: an ultra-fast all-in-one FASTQ preprocessor [J]. Bioinformatics, 2018, 34(17): i884-i890. doi: 10.1093/bioinformatics/bty560
    [6] Wang J, Raskin L, Samuels D C, et al. Genome measures used for quality control are dependent on gene function and ancestry [J]. Bioinformatics, 2015, 31(3): 318-323. doi: 10.1093/bioinformatics/btu668
    [7] Paszkiewicz K H, Farbos A, O'neill P, et al. Quality control on the frontier [J]. Front Genet, 2014, 5: 157.
    [8] Sprang M, Kruger M, Andrade-Navarro M A, et al. Statistical guidelines for quality control of next-generation sequencing techniques [J]. Life Sci Alliance, 2021, 4(11): 65.
    [9] Bedre R, Avila C, Mandadi K. HTSQualC is a flexible and one-step quality control software for high-throughput sequencing data analysis [J]. Sci Rep, 2021, 11(1): 18725. doi: 10.1038/s41598-021-98124-3
    [10] Albrecht S, Sprang M, Andrade-Navarro M A, et al. seqQscorer: automated quality control of next-generation sequencing data using machine learning [J]. Genome Biol, 2021, 22(1): 75. doi: 10.1186/s13059-021-02294-2
    [11] Institute B. FastQC: A quality control tool for high throughput sequence data [EB/OL]. 2023-05-17.https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
    [12] Wingett S W, Andrews S. FastQ Screen: A tool for multi-genome mapping and quality control [J]. F1000Res, 2018, 7: 1338. doi: 10.12688/f1000research.15931.1
    [13] Okonechnikov K, Conesa A, Garcia-Alcalde F. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data [J]. Bioinformatics, 2016, 32(2): 292-294. doi: 10.1093/bioinformatics/btv566
    [14] He X, Chen S, Li R, et al. Comprehensive fundamental somatic variant calling and quality management strategies for human cancer genomes [J]. Brief Bioinform, 2021, 22(3): 1-15.
    [15] Cock P J, Fields C J, Goto N, et al. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants [J]. Nucleic Acids Res, 2010, 38(6): 1767-1771. doi: 10.1093/nar/gkp1137
    [16] Ewing B, Green P. Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities [J]. 1998, 8(3): 186-194.
    [17] Iso. Genomics informatic — Quality control metrics for DNA sequencing: ISO/TC 215/SC 1 [S]. Genomics Informatics, 2020.
    [18] Yang X, Liu D, Liu F, et al. HTQC: a fast quality control toolkit for Illumina sequencing data [J]. BMC Bioinformatics, 2013, 14: 33. doi: 10.1186/1471-2105-14-33
    [19] Patel R K, Jain M. NGS QC Toolkit: a toolkit for quality control of next generation sequencing data [J]. PLoS One, 2012, 7(2): e30619. doi: 10.1371/journal.pone.0030619
    [20] Cox M P, Peterson D A, Biggs P J. SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data [J]. BMC Bioinformatics, 2010, 11: 485. doi: 10.1186/1471-2105-11-485
    [21] Chen Y, Chen Y, Shi C, et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data [J]. Gigascience, 2018, 7(1): 1-6.
    [22] Zhang T, Luo Y, Liu K, et al. BIGpre: a quality assessment package for next-generation sequencing data [J]. Genomics Proteomics Bioinformatics, 2011, 9(6): 238-244. doi: 10.1016/S1672-0229(11)60027-2
    [23] Yin Z, Zhang H, Liu M, et al. RabbitQC: high-speed scalable quality control for sequencing data [J]. Bioinformatics, 2021, 37(4): 573-574. doi: 10.1093/bioinformatics/btaa719
    [24] Alser M, Rotman J, Deshpande D, et al. Technology dictates algorithms: recent developments in read alignment [J]. Genome Biol, 2021, 22(1): 249. doi: 10.1186/s13059-021-02443-7
    [25] Canzar S, Salzberg S L. Short Read Mapping: An Algorithmic Tour [J]. Proc IEEE Inst Electr Electron Eng, 2017, 105(3): 436-458. doi: 10.1109/JPROC.2015.2455551
    [26] Liao Y, Smyth G K, Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote [J]. Nucleic Acids Res, 2013, 41(10): e108. doi: 10.1093/nar/gkt214
    [27] Wilton R, Szalay A S. Performance optimization in DNA short-read alignment [J]. Bioinformatics, 2022, 41(10): e108.
    [28] Langmead B, Trapnell C, Pop M, et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome [J]. Genome Biol, 2009, 38(8): 2081-2087.
    [29] Langmead B, Salzberg S L. Fast gapped-read alignment with Bowtie 2 [J]. Nat Methods, 2012, 9(4): 357-359. doi: 10.1038/nmeth.1923
    [30] Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform [J]. Bioinformatics, 2009, 25(14): 1754-1760. doi: 10.1093/bioinformatics/btp324
    [31] Altschul S F, Gish W, Miller W, et al. Basic local alignment search tool [J]. J Mol Biol, 1990, 215(3): 403-410. doi: 10.1016/S0022-2836(05)80360-2
    [32] Altschul S F, Madden T L, Schaffer A A, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J]. Nucleic Acids Res, 1997, 25(17): 3389-3402. doi: 10.1093/nar/25.17.3389
    [33] Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores [J]. Genome Res, 2008, 18(11): 1851-1858. doi: 10.1101/gr.078212.108
    [34] Kim D, Paggi J M, Park C, et al. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype [J]. Nat Biotechnol, 2019, 37(8): 907-915. doi: 10.1038/s41587-019-0201-4
    [35] Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools [J]. Bioinformatics, 2009, 25(16): 2078-2079. doi: 10.1093/bioinformatics/btp352
    [36] Iso. Biotechnology — Massively parallel sequencing —Part 2: Quality evaluation of sequencing data: ISO/TC 276[S]. Biotechnology, 2021.
    [37] Lassmann T, Hayashizaki Y, Daub C O. SAMStat: monitoring biases in next generation sequencing data [J]. Bioinformatics, 2011, 27(1): 130-131. doi: 10.1093/bioinformatics/btq614
    [38] Garcia-Alcalde F, Okonechnikov K, Carbonell J, et al. Qualimap: evaluating next-generation sequencing alignment data [J]. Bioinformatics, 2012, 28(20): 2678-2679. doi: 10.1093/bioinformatics/bts503
    [39] Li B, Zhan X, Wing M K, et al. QPLOT: a quality assessment tool for next generation sequencing data [J]. Biomed Res Int, 2013, 2013: 865181.
    [40] Jun G, Flickinger M, Hetrick K N, et al. Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data [J]. Am J Hum Genet, 2012, 91(5): 839-848. doi: 10.1016/j.ajhg.2012.09.004
    [41] Xu C. A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data [J]. Comput Struct Biotechnol J, 2018, 16: 15-24. doi: 10.1016/j.csbj.2018.01.003
    [42] Koboldt D C, Zhang Q, Larson D E, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing [J]. Genome Res, 2012, 22(3): 568-576. doi: 10.1101/gr.129684.111
    [43] Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data [J]. Bioinformatics, 2011, 27(21): 2987-2993. doi: 10.1093/bioinformatics/btr509
    [44] Saunders C T, Wong W S, Swamy S, et al. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs [J]. Bioinformatics, 2012, 28(14): 1811-1817. doi: 10.1093/bioinformatics/bts271
    [45] Cibulskis K, Lawrence M S, Carter S L, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples [J]. Nat Biotechnol, 2013, 31(3): 213-219. doi: 10.1038/nbt.2514
    [46] Kim S, Scheffler K, Halpern A L, et al. Strelka2: fast and accurate calling of germline and somatic variants [J]. Nat Methods, 2018, 15(8): 591-594. doi: 10.1038/s41592-018-0051-x
    [47] Danecek P, Auton A, Abecasis G, et al. The variant call format and VCFtools [J]. Bioinformatics, 2011, 27(15): 2156-2158. doi: 10.1093/bioinformatics/btr330
    [48] Krusche P, Trigg L, Boutros P C, et al. Best practices for benchmarking germline small-variant calls in human genomes [J]. Nat Biotechnol, 2019, 37(5): 555-560. doi: 10.1038/s41587-019-0054-x
    [49] Cohort F T. 中华家系1号 [EB/OL]. 2023-05-17.https://chinese-quartet.org/.
    [50] Zhang F, Kang H M. FASTQuick: rapid and comprehensive quality assessment of raw sequence reads [J]. Gigascience, 2021, 10(2): 143768.
    [51] Darby C A, Gaddipati R, Schatz M C, et al. Vargas: heuristic-free alignment for assessing linear and graph read aligners [J]. Bioinformatics, 2020, 36(12): 3712-3718. doi: 10.1093/bioinformatics/btaa265
    [52] Wilton R, Budavari T, Langmead B, et al. Arioc: high-throughput read alignment with GPU-accelerated exploration of the seed-and-extend search space [J]. PeerJ, 2015, 3: e808. doi: 10.7717/peerj.808
    [53] Piccolo S R, Frampton M B. Tools and techniques for computational reproducibility [J]. Gigascience, 2016, 5(1): 30. doi: 10.1186/s13742-016-0135-4
    [54] Garijo D, Kinnings S, Xie L, et al. Quantifying reproducibility in computational biology: the case of the tuberculosis drugome [J]. PLoS One, 2013, 8(11): e80278. doi: 10.1371/journal.pone.0080278
    [55] Loman N, Watson M. So you want to be a computational biologist? [J]. Nature Biotechnology, 2013, 31(11): 996-998. doi: 10.1038/nbt.2740
    [56] Bray N L, Pimentel H, Melsted P, et al. Near-optimal probabilistic RNA-seq quantification [J]. Nature Biotechnology, 2016, 34(5): 525-527. doi: 10.1038/nbt.3519
    [57] Steinbiss S, Silva-Franco F, Brunk B, et al. Companion: a web server for annotation and analysis of parasite genomes [J]. Nucleic Acids Res, 2016, 44(W1): W29-W34. doi: 10.1093/nar/gkw292
    [58] Jun G, Wing M K, Abecasis G R, et al. An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data [J]. Genome Res, 2015, 25(6): 918-925. doi: 10.1101/gr.176552.114
    [59] Di Tommaso P, Chatzou M, Floden E W, et al. Nextflow enables reproducible computational workflows [J]. Nat Biotechnol, 2017, 35(4): 316-319. doi: 10.1038/nbt.3820
    [60] Vivian J, Rao A A, Nothaft F A, et al. Toil enables reproducible, open source, big biomedical data analyses [J]. Nat Biotechnol, 2017, 35(4): 314-316. doi: 10.1038/nbt.3772
    [61] Molder F, Jablonski K P, Letcher B, et al. Sustainable data analysis with Snakemake [J]. F1000Res, 2021, 10: 33. doi: 10.12688/f1000research.29032.2
    [62] Galaxy C. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update [J]. Nucleic Acids Res , 2016(W1): W3-W10.
    [63] Oliver H, Shin M, Matthews D, et al. Workflow Automation for Cycling Systems [J]. Computing in Science & Engineering, 2019, 21(4): 7-21.
    [64] Li J, Jew B, Zhan L, et al. ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest [J]. PLoS Comput Biol, 2019, 15(12): e1007556. doi: 10.1371/journal.pcbi.1007556
  • 加载中
图(1) / 表(7)
计量
  • 文章访问数:  92
  • HTML全文浏览量:  72
  • PDF下载量:  23
  • 被引次数: 0
出版历程
  • 收稿日期:  2023-03-27
  • 录用日期:  2023-05-09
  • 修回日期:  2023-05-23
  • 网络出版日期:  2023-08-07
  • 刊出日期:  2023-05-31

目录

    /

    返回文章
    返回