Neural Network Approach to Matching of Mass Spectral Data

Description:

Researchers as TGen have developed methods and algorithms to improve the yield of identifiable protein sequences from mass spectrometry data. Using a Convolutional Neural Network (CNN) approach, proteins, peptides, and other small molecules can be classified and identified from mass spectrometry data with additional confidence in a high throughput and high sensitivity manner.

Mass spectrometry has become the established standard for protein profiling however, traditional methods of processing the proteomics data cannot utilize a majority of the data (approximately 80% on average) to characterize the molecules in the sample. De novo sequencing is limited by determining amino acids from spectra data based upon a strict list of rules, and the process cannot account for chemical variations. Comparison of the spectra data with existing databases typically results in most mutations and rare sequences being filtered out of the analysis because these mutations and sequences are not accounted for in the existing databases.

At TGen, researchers took a machine learning approach to improve mass spectral data analysis using a CNN capable of utilizing multiple mathematical models that iteratively optimize the mapping of inputs to outputs, thus training the network to determine the presence or absence of a molecule or sub-sequence. The CNN has application in confirming the identify of known spectral matched molecules, identifying molecules from an unknown sample, and identifying or further characterizing unknown and/or unmatched spectra or variants not found in canonical databases, such as peptides, cyclic peptides, metabolites, amino acids, post-translational modifications, glycans, lipids, and fusion peptides. TGen’s classifier models used in the CNN were validated with at least 93% accuracy, and the CNN is designed with the built-in capability to further integrate additional models targeting features, such as length, diversity, and frequency, to expand the functionality of the CNN to determining the complete peptide sequence from a spectrum.

Link to Issued US Patent 11,587,644