MALDI-TOF MS指纹图谱大数据集全流程前处理方案研究

    A Comprehensive Data Preprocessing Pipeline for Constructing Large-Scale MALDI-TOF MS Fingerprint Dataset

    • 摘要: 针对基质辅助激光解吸/电离-飞行时间质谱(MALDI-TOF MS)原始图谱重现性欠佳、质量标度漂移等问题,瞄准现有商业软件和开源算法在实际应用时的局限性,开发了一套全流程前处理方案(TOFpipe),旨在为构建高质量MALDI-TOF MS指纹图谱大数据集提供优化的技术支持。该方案涵盖Profile模式原始数据的平滑滤噪、基线校正、质量标度校准、Centroid模式转换和异常指纹检出。TOFpipe利用小波变换求导技术实现质谱峰检测与峰宽估计,创新性地设计了基于指数修正高斯函数与线性基线的质谱峰拟合策略,在实现高效去噪和基线校正的同时,还能最大限度保持质谱峰轮廓及相对强度关系不被破坏。采集并将TOFpipe应用于12种植物油共1275张MALDI-TOF MS原始图谱的批量前处理。与常规质谱峰拟合方法相比,TOFpipe在拟合展宽质谱峰和低信噪比区域时,更能有效避免“伪影”和轮廓失真。TOFpipe基于拟合参数可稳健地完成质谱峰质心与峰面积计算,从而高保真地实现Profile至Centroid模式转换。针对信号漂移,TOFpipe采用基于特征峰的分段校准策略实现质量标度的位移(漂移)和/或伸缩的精确校准。经校准后,其中八个植物油指纹图谱子集第一主成分的解释方差增幅为4.79%~38.40%。此外,TOFpipe通过基于余弦距离的多维尺度变换(MDS)技术对样本分布进行降维可视化,成功从23份葵花籽油样品中确认出2份“非典型”的高油酸葵花籽油指纹。最后,筛选出1200张指纹图谱构建数据集,整体评估结果中不同品种植物油呈现良好的簇状分离,表明TOFpipe前处理方案能够为构建高质量MALDI-TOF MS指纹图谱大数据集提供可靠技术前提。

       

      Abstract: The embedded algorithms in commercial software and existing open-source toolkits exhibited limitations in dealing with the poor reproducibility and mass scale drift of raw matrix assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) spectra. Hence, a comprehensive data preprocessing pipeline, named TOFpipe, was developed to provide optimized technical support for constructing high-quality MALDI-TOF MS fingerprint dataset. The pipeline covered the full process of Profile-mode raw spectra pre-processing, from smoothing and denoising, baseline subtraction, mass scale calibration, data compression from Profile-mode to Centroid-mode to rapid detection of outliers. TOFpipe innovatively employed a wavelet transform-based derivative technique for peak detection and peak width estimation, and integrated a peak fitting strategy that combines an Exponentially Modified Gaussian (EMG) function with a linear baseline. These strategies enabled efficient denoising and baseline subtraction while maximally preserving the details of the original MS peak profiles and their relative intensity relationships. In this study, TOFpipe was applied to process 1,275 raw MALDI-TOF MS spectra from 12 different vegetable oils. Compared to conventional peak fitting strategies, TOFpipe effectively avoided “artifacts” and profile distortion especially when dealing with broadened peaks and regions with low signal-to-noise ratios. On this basis, the calculation of the centroid and area of MS peaks could perform with robustness, enabling high-fidelity conversion from Profile-mode to Centroid-mode. Additionally, TOFpipe employed a characteristics peaks-based segmentation strategy to precisely calibrate the mass scale offset (drift) and/or scaling. After calibration, the variance explained by the first principal component from MALDI-TOF MS spectral subsets of 12 species of vegetable oils have increased up to 4.49%~38.40%. Furthermore, TOFpipe employed cosine distance-based Multidimensional Scaling (MDS) analysis to reduce dimensionality reduction and visualize sample distributions, and successfully identified the fingerprints of 2 “atypical” high-oleic sunflower oils from 23 sunflower oil samples. Finally, 1200 spectra were selected to construct a fingerprint dataset. In the overall evaluation, different species of vegetable oils exhibited well-separated clustering in the projected space, indicating that TOFpipe can provide a reliable technical prerequisite for the construction of high-quality and highly reliable MALDI-TOF MS fingerprint datasets.

       

    /

    返回文章
    返回