L. Nanni1,*, S. Brahnam2
1. DEI, University of Padua, viale Gradenigo 6, Padua, Italy
2. Information Technology and Cybersecurity, Missouri State University, 901 S. National, Springfield, MO 65804, USA
The last decade has witnessed an unprecedented accumulation of proteins in large online databases which has led to the need for automatic prediction of protein function essential for massive and timely annotations of the proteins in these datasets. Protein databases, combined with functional annotations and machine learning (ML) techniques, offer many potential benefits, including significantly facilitating rapid pharmacological target identification. The main objective of this study is to identify, for the problem of enzyme classification, the most powerful combinations of descriptors taken from different protein representations. To achieve this objective, four approaches for representing the Position-Specific Scoring Matrix (PSSM) combined with three methods for representing the Amino Acid Sequence (AAS) are evaluated with the aim of experimentally producing a powerful ensemble of descriptors for enzyme function prediction. Each protein descriptor is classified by a Support Vector Machine (SVM), with the set of SVMs finally combined by sum rule. Cross-validation experiments using these descriptors on single-functional enzymes (n=44,661) extracted from the PDB database demonstrate that the ensemble proposed here achieves superior classification rates compared to state-of-the-art ML techniques reported in the literature on the same dataset. Although the proposed ensemble strongly outperforms these other techniques, it is computationally much heavier, mainly because the PSSM extraction process is time consuming. However, there is a growing repository of proteins where PSSM has already been extracted, making the proposed method more practical and attractive. The MATLAB code and the dataset used in the experiments reported here are available at https://github.com/LorisNanni.
Protein classification, Enzyme classification, Support Vector Machine, Position-Specific Scoring Matrix
L. Nanni, S. Brahnam (2020). Set of Approaches Based on Position Specific Scoring Matrix and Amino Acid Sequence for Primary Category Enzyme Classification. Journal of Artificial Intelligence and Systems, 2, 38–52. https://doi.org/10.33969/AIS.2020.21004.
 A. Godzik, "Metagenomics and the protein universe," Current Opinion in Structural Biology, vol. 21, no. 3, pp. 398-403, 2011/06/01/ 2011, doi: https://doi.org/10.1016/j.sbi.2011.03.010.
 S. Amidi, A. Amidi, D. Vlachakis, N. Paragios, and E. I. Zacharaki, "Automatic single- and multi-label enzymatic function prediction by machine learning," PeerJ, vol. 5, p. e3095, 2017/03/29 2017, doi: 10.7717/peerj.3095.
 E. C. Webb, "Enzyme nomenclature 1992," in Recommendations of the nomenclature committee of the international union of biochemistry and molecular biology on the nomenclature and classification of enzymes. San Diego: Academic Press, 1992.
 S. Boyce and K. F. Tipton, A. F. Agrò, Ed. Nature encyclopedia of life sciences. London: Nature Publishing Group, 2001.
 Y. B. Ruiz-Blanco, G. Agüero-Chapin, E. García-Hernández, O. Álvarez, A. Antunes, and J. Green, "Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone," (in eng), BMC bioinformatics, vol. 18, no. 1, pp. 349-349, 2017, doi: 10.1186/s12859-017-1758-x.
 C. Z. Cai, L. Y. Han, Z. L. Ji, X. Chen, and Y. Z. Chen, "SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence," Nucleic Acids Research, vol. 31, no. 13, pp. 3692-3697, 2003, doi: 10.1093/nar/gkg600.
 L. Y. Han, C. Z. Cai, Z. L. Ji, Z. W. Cao, J. Cui, and Y. Z. Chen, "Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach," Nucleic Acids Research, vol. 32, no. 21, pp. 6437-6444, 2004, doi: 10.1093/nar/gkh984.
 C. Chen, Y.-X. Tian, X.-Y. Zou, P.-X. Cai, and J.-Y. Mo, "Using pseudo-amino acid composition and support vector machine to predict protein structural class," Journal of Theoretical Biology, vol. 243, no. 3, pp. 444-448, 2006/12/07/ 2006, doi: https://doi.org/10.1016/j.jtbi.2006.06.025.
 P. D. Dobson and A. J. Doig, "Predicting Enzyme Class From Protein Structure Without Alignments," Journal of Molecular Biology, vol. 345, no. 1, pp. 187-199, 2005/01/07/ 2005, doi: https://doi.org/10.1016/j.jmb.2004.10.024.
 X. B. Zhou, C. Chen, Z. C. Li, and X. Y. Zou, "Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes," Journal of Theoretical Biology, vol. 248, pp. 546-551, 2007, doi: DOI 10.1016/j.jtbi.2007.06.001.
 L. Lu, Z. Qian, Y.-D. Cai, and Y. Li, "ECS: An automatic enzyme classifier based on functional domain composition," Computational Biology and Chemistry, vol. 31, no. 3, pp. 226-232, 2007/06/01/ 2007, doi: https://doi.org/10.1016/j.compbiolchem.2007.03.008.
 Q. Jian-Ding, H. Jian-Hua, S. Shao-Ping, and L. Ru-Ping, "Using the Concept of Chous Pseudo Amino Acid Composition to Predict Enzyme Family Classes: An Approach with Support Vector Machine Based on Discrete Wavelet Transform," Protein & Peptide Letters, vol. 17, no. 6, pp. 715-722, 2010, doi: http://dx.doi.org/10.2174/092986610791190372.
 A. Amidi, S. Amidi, D. Vlachakis, N. Paragios, and E. I. Zacharaki, "A machine learning methodology for enzyme functional classification combining structural and protein sequence descriptors," in Bioinformatics and Biomedical Engineering. Cham: Springer, 2016, pp. 728-738.
 W.-L. Huang, H.-M. Chen, S.-F. Hwang, and S.-Y. Ho, "Accurate prediction of enzyme subfamily class using an adaptive fuzzy k-nearest neighbor method," Biosystems, vol. 90, no. 2, pp. 405-413, 2007/09/01/ 2007, doi: https://doi.org/10.1016/j.biosystems.2006.10.004.
 H.-B. Shen and K.-C. Chou, "EzyPred: A top–down approach for predicting enzyme functional classes and subclasses," Biochemical and Biophysical Research Communications, vol. 364, no. 1, pp. 53-59, 2007/12/07/ 2007, doi: https://doi.org/10.1016/j.bbrc.2007.09.098.
 E. Nasibov and C. Kandemir-Cavas, "Efficiency analysis of KNN and minimum distance-based classifiers in enzyme family prediction," Computational Biology and Chemistry, vol. 33, no. 6, pp. 461-464, 2009/12/01/ 2009, doi: https://doi.org/10.1016/j.compbiolchem.2009.09.002.
 B. J. Lee, M. S. Shin, Y. J. Oh, H. S. Oh, and K. H. Ryu, "Identification of protein functions using a machine-learning approach based on sequence-derived properties," (in eng), Proteome science, vol. 7, pp. 27-27, 2009, doi: 10.1186/1477-5956-7-27.
 C. Kumar and A. Choudhary, "A top-down approach to classify enzyme functional classes and sub-classes using random forest," (in eng), EURASIP journal on bioinformatics & systems biology, vol. 2012, no. 1, p. 1, Feb 29 2012, doi: 10.1186/1687-4153-2012-1.
 C. Nagao, N. Nagano, and K. Mizuguchi, "Prediction of detailed enzyme functions and identification of specificity determining residues by random forests," PLoS ONE, vol. 9, no. 1, p. e84623, 2014, doi: https://doi.org/10.1371/journal.pone.0084623.
 V. Volpato, A. Adelfio, and G. Pollastri, "Accurate prediction of protein enzymatic class by N-to-1 Neural Networks," BMC Bioinformatics, journal article vol. 14, no. 1, p. S11, January 14 2013, doi: 10.1186/1471-2105-14-s1-s11.
 G. Agüero-Chapin, G. Pérez-Machado, R. Molina-Ruiz, Y. Morales-Helguera, V. Vasconcelos, and A. Antunes, "Ti2biop: topological indices to biopolymers. Its practical use to unravel cryptic bacteriocin-like domains," Amino Acids, vol. 40, no. 2, pp. 431-442, 2011, doi: https://doi.org/10.1007/s00726-010-0653-9.
 S. K. Yadav and A. K. Tiwari, "Classification of enzymes using machine learning based approaches: a review," Machine Learning and Applications, vol. 2, no. 3/4, pp. 30-49, 2015.
 M. Sharma and P. Garg, "Computational Approaches for Enzyme Functional Class Prediction: A Review," Current Proteomics, vol. 11, no. 1, pp. 17-22, 2014, doi: http://dx.doi.org/10.2174/1570164611666140415225013.
 L. C. Borro et al., "Predicting enzyme class from protein structure using Bayesian classification," Genetics and Molecular Research, vol. 5, no. 1, pp. 193-202, 2006.
 K. M. Borgwardt, C. S. Ong, S. Schönauer, S. V. N. Vishwanathan, A. J. Smola, and H.-P. Kriegel, "Protein function prediction via graph kernels," Bioinformatics, vol. 21, no. suppl_1, pp. i47-i56, 2005, doi: 10.1093/bioinformatics/bti1007.
 M. Gribskov, A. D. McLachlan, and D. Eisenberg, "Profile analysis: Detection of distantly related proteins," presented at the Proceedings of the National Academy of Sciences (PNAS), 1987.
 J. Wang et al., "POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles," Bioinformatics, vol. 33, pp. 2756–2758, 2017.
 T. Liu, X. Zheng, and J. Wang, "Prediction of protein structural class for low-similarity sequences using support vector machine and PSI-BLAST profile," Biochimie, vol. 92, no. 10, pp. 1330-1334, 2010/10/01/ 2010, doi: https://doi.org/10.1016/j.biochi.2010.06.013.
 A. Lobley, M. I. Sadowski, and D. T. Jones, "pGenTHREADER and pDomTHREADER: new methods for improved protein fold recognition and superfamily discrimination," Bioinformatics, vol. 25, no. 14, pp. 1761-1767, 2009, doi: 10.1093/bioinformatics/btp302.
 J. Zahiri, O. Yaghoubi, M. Mohammad-Noori, R. Ebrahimpour, and A. Masoudi-Nejad, "PPIevo: Protein–protein interaction prediction from PSSM based evolutionary information," Genomics, vol. 102, no. 4, pp. 237-242, 2013/10/01/ 2013, doi: https://doi.org/10.1016/j.ygeno.2013.05.006.
 D. Xie, A. Li, M. Wang, Z. Fan, and H. Feng, "LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST," Nucleic Acids Research, vol. 33, no. suppl_2, pp. W105-W110, 2005, doi: 10.1093/nar/gki359.
 C.-W. Cheng, E. C.-Y. Su, J.-K. Hwang, T.-Y. Sung, and W.-L. Hsu, "Predicting RNA-binding sites of proteins using support vector machines and evolutionary information," (in eng), BMC bioinformatics, vol. 9 Suppl 12, no. Suppl 12, pp. S6-S6, 2008, doi: 10.1186/1471-2105-9-S12-S6.
 P. Radivojac et al., "A large-scale evaluation of computational protein function prediction," (in eng), Nature methods, vol. 10, no. 3, pp. 221-227, 2013, doi: 10.1038/nmeth.2340.
 X.-Y. Cheng et al., "A global characterization and identification of multifunctional enzymes," (in eng), PloS one, vol. 7, no. 6, pp. e38979-e38979, 2012, doi: 10.1371/journal.pone.0038979.
 Z. U. Khan, M. Hayat, and M. A. Khan, "Discrimination of acidic and alkaline enzyme using Chou’s pseudo amino acid composition in conjunction with probabilistic neural network model," Journal of Theoretical Biology, vol. 365, pp. 197-203, 2015/01/21/ 2015, doi: https://doi.org/10.1016/j.jtbi.2014.10.014.
 C. Fernandez-Lozano et al., "Improving enzyme regulatory protein classification by means of SVM-RFE feature selection," Molecular BioSystems, 10.1039/C3MB70489K vol. 10, no. 5, pp. 1063-1071, 2014, doi: 10.1039/C3MB70489K.
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," in Advances In Neural Information Processing Systems, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger Eds. Red Hook, NY: Curran Associates, Inc., 2012, pp. 1097-1105.
 M. Spencer, J. Eickholt, and J. Cheng, "A Deep Learning Network Approach to ab initio Protein Secondary Structure Prediction," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 12, no. 1, pp. 103-112, 2015, doi: 10.1109/TCBB.2014.2343960.
 Y. Li and T. Shibuya, "Malphite: A convolutional neural network and ensemble learning based protein secondary structure predictor," in 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 9-12 Nov. 2015 2015, pp. 1260-1266, doi: 10.1109/BIBM.2015.7359861.
 E. I. Zacharaki, "Prediction of protein function using a deep convolutional neural network ensemble," PeerJ Computer Science, vol. 3, p. e123, 2017. [Online]. Available: https://peerj.com/articles/cs-124/.
 M. Kulmanov, M. A. Khan, and R. Hoehndorf, "DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier," Bioinformatics, vol. 34, no. 4, pp. 660-668, 2017, doi: 10.1093/bioinformatics/btx624.
 M. Kulmanov and R. Hoehndorf, "DeepGOPlus: improved protein function prediction from sequence," Bioinformatics, 2019, doi: 10.1093/bioinformatics/btz595.
 R. You, X. Huang, and S. Zhu, "DeepText2GO: Improving large-scale protein function prediction with deep semantic text representation," Methods, vol. 145, pp. 82-90, 2018/08/01/ 2018, doi: https://doi.org/10.1016/j.ymeth.2018.05.026.
 S. Kawashima and M. Kanehisa, "AAindex: amino acid index database," Nucleic Acids Research, vol. 27, no. 1, pp. 368-369, 374 1999. [Online]. Available: https://pdfs.semanticscholar.org/0e92/23abb1f973eff54d20486f0dab90c7dde9e0.pdf.
 G. Fumera and F. Roli, "Performance analysis and comparison of linear combiners for classifier fusion," presented at the Structural, Syntactic and Statistical Pattern Recognition and IAPR International Workshops, Ontario, Canada, 2002.
 L. Nanni, S. Brahnam, and A. Lumini, "High performance set of PseAAC descriptors extracted from the amino acid sequence for protein classification," Journal of Theoretical Biology, vol. 266, no. 1, pp. 1-10, 2010.
 J. Guo, Y. Lin, and Z. Sun, "A novel method for protein subcellular localization: Combining residue-couple model and SVM," presented at the Proceedings of 3rd Asia-Pacific Bioinformatics Conference, Singapore, 2005.
 Y. H. Zeng, Y. Z. Guo, R. Q. Xiao, L. Yang, L. Z. Yu, and M. L. Li, "Using the augmented Chou's pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach," Journal of Theoretical Biology, vol. 259, no. 2, pp. 366-72, 2009, doi: doi:10.1016/j.jtbi.2009.03.028.
 K.-C. Chou, "Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology," Current Proteomics, vol. 6, pp. 262-274, 2009.
 G.-L. Fan and Q.-Z. Li, "Predicting protein submitochondrion locations by combining different descriptors into the general form of Chou's pseudo amino acid composition," Amino Acids, vol. 20, no. Nov, pp. 1-11, 2011.
 J. C. Jeong, X. Lin, and X.-W. Chen, "On position-specific scoring matrix for protein function prediction," IEEE/ACM transactions on computational biology and bioinformatics, vol. 8, no. 2, pp. 308-315, 2011.
 L. Yang et al., "Using auto covariance method for functional discrimination of membrane proteins based on evolution information," Amino Acids, vol. 38, pp. 1497-1503, 2010.
 N. Ahmed, T. Natarajan, and K. R. Rao, "Discrete cosine transform," IEEE Trans Comput, vol. C-23, no. 1, pp. 90-93, 1974.
 P. Auer, H. Burgsteiner, and W. Maass, "A learning rule for very simple universal approximators consisting of a single layer of perceptrons," Neural Networks, vol. 21, no. 5, pp. 786-795, 2008/06/01/ 2008, doi: https://doi.org/10.1016/j.neunet.2007.12.036.
 T. K. Ho, "Random decision forests," presented at the ICDAR95 Third International Conference on Document Analysis and Recognition, Montreal, QC, 1995.