Enzyme Function Prediction with Interpretable Models

Syed, Umar; Yona, Golan

doi:10.1007/978-1-59745-243-4_17

Umar Syed⁶ &
Golan Yona^7,8

Part of the book series: Methods in Molecular Biology ((MIMB,volume 541))

2837 Accesses
15 Citations

Abstract

Enzymes play central roles in metabolic pathways, and the prediction of metabolic pathways in newly sequenced genomes usually starts with the assignment of genes to enzymatic reactions. However, genes with similar catalytic activity are not necessarily similar in sequence, and therefore the traditional sequence similarity-based approach often fails to identify the relevant enzymes, thus hindering efforts to map the metabolome of an organism.

Here we study the direct relationship between basic protein properties and their function. Our goal is to develop a new tool for functional prediction (e.g., prediction of Enzyme Commission number), which can be used to complement and support other techniques based on sequence or structure information. In order to define this mapping we collected a set of 453 features and properties that characterize proteins and are believed to be related to structural and functional aspects of proteins. We introduce a mixture model of stochastic decision trees to learn the set of potentially complex relationships between features and function. To study these correlations, trees are created and tested on the Pfam classification of proteins, which is based on sequence, and the EC classification, which is based on enzymatic function. The model is very effective in learning highly diverged protein families or families that are not defined on the basis of sequence. The resulting tree structures highlight the properties that are strongly correlated with structural and functional aspects of protein families, and can be used to suggest a concise definition of a protein family.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Functional Annotation from Structural Homology

Elucidation of Metabolic Pathways from Enzyme Classification Data

UniKP: a unified framework for the prediction of enzyme kinetic parameters

Article Open access 11 December 2023

References

Kanehisa, M. and Goto, S. (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucl. Acids Res. 28, 27–30.
Article PubMed CAS Google Scholar
Caspi, R., Foerster, H., Fulcher, C.A., Hopkinson, R., Ingraham, J., Kaipa, P., Krummenacker, M., Paley, S., Pick, J., Rhee, S. Y., Tissier, C., Zhang, P., and Karp, P. D. (2006) MetaCyc: a multiorganism database of metabolic pathways and enzymes. Nucl. Acids Res. 34, D511–D516.
Article PubMed CAS Google Scholar
Paley, S. M. and Karp, P.D. (2002) Evaluation of computational metabolic-pathway predictions for Helicobacter pylori. Bioinformatics 18, 715–724.
Google Scholar
Bono, H., Ogata, H., Goto, S., and Kanehisa, M. (1998) Reconstruction of amino acid biosynthesis pathways from the complete genome sequence. Genome Res. 8, 203–210.
PubMed CAS Google Scholar
Green, M. and Karp, P. D. (2004) A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformatics 5, 76.
Google Scholar
Chen, L. and Vitkup, D. (2006) Predicting genes for orphan metabolic activities using phylogenetic profiles. Genome Biol. 7, R17.
Google Scholar
Kharchenko, P., Chen, L., Freund, Y., Vitkup, D., and Church, G. M. (2006) Identifying metabolic enzymes with multiple types of association evidence. BMC Bioinformatics 7, 177.
Article PubMed Google Scholar
Popescu, L. and Yona, G. (2005) Automation of gene assignments to metabolic pathways using high-throughput expression data. BMC Bioinformatics 6, 217.
Google Scholar
Popescu, L. and Yona, G. (2006) Expectation-maximization algorithms for fuzzy assignment of genes to cellular pathways. In proceedings of the 2006 Computational Systems Bioinformatics Conference.
Google Scholar
Yaminishi, Y., Vert, J., and Kanehisa, M. (2005) Supervised enzyme network inference from the integration of genomic data and chemical information. Bioinformatics 21, i468–i477.
Article Google Scholar
http://www.chem.qmw.ac.uk/iubmb/enzyme/
Shah, I. and Hunter, L. (1997) Predicting enzyme function from sequence: a systematic appraisal. In the Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology 276–283.
Google Scholar
Wilson, D. B. and Irwin, D. C. (1999) Genetics and properties of cellulases. Adv. Biochem. Eng. 65, 2–21.
Google Scholar
Stawiski, E. W., Baucom, A. E., Lohr, S. C., and Gregoret, L. M. (2000) Predicting protein function from structure: unique structural features of proteases. Proc. Natl. Acad. Sci. U.S.A. 97, 3954–3958.
Google Scholar
Todd, A. E., Orengo, C. A., and Thornton, J. M. (2001) Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307, 1113–1143.
Article PubMed CAS Google Scholar
Devos, D. and Valencia, A. (2000) Practical limits of function prediction. Prot. Struct. Func. Genet. 41, 98–107.
Google Scholar
Holm, L. and Sander, C. (1994) The FSSP database of structurally aligned protein fold families. Nucl. Acids Res. 22, 3600–3609.
Google Scholar
Wilson, C. A., Kreychman, J., and Gerstein, M. (2000) Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J. Mol. Biol. 297, 233–249.
Article PubMed CAS Google Scholar
Murzin A. G., Brenner S. E., Hubbard T., Chothia C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540.
PubMed CAS Google Scholar
Rost, B. (2002) Enzyme function less conserved than anticipated. J. Mol. Biol. 318, 595–608.
Article PubMed CAS Google Scholar
desJardins, M., Karp, P. D., Krummenacker, M., Lee, T. J., and Ouzounis, C. A. (1997) Prediction of enzyme classification from protein sequence without the use of sequence similarity. In the Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology 92–99.
Google Scholar
Borro, L. C., Oliveira, S. R. M., Yamagishi, M. E. B., Mancini, A. L., Jardine, J. G., Mazoni, I., dos Santos, E. H., Higa, R. H., Kuser P. R., and Neshich G. (2006) Predicting enzyme class from protein structure using Bayesian classification. Genet. Mol. Res. 5, 193–202.
Google Scholar
Cai, Y-D. and Chou, K-C. (2004) Using functional domain composition to predict enzyme family classes. J. Proteome Res. 4, 109–111.
Google Scholar
The Gene Ontology Consortium. (2000) Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29.
Article Google Scholar
Clare, A. and King R. D. (2003) Predicting gene function in Saccharomyces cerevisiae. Bioinformatics 19, ii42–ii49
Article PubMed Google Scholar
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25, 3389–3402.
Article PubMed CAS Google Scholar
Mewes, H. W., Heumann, K., Kaps, A., Mayer, K., Pfeiffer, F., Stocker, S., and Frishman, D. (1999) MIPS: a database for genomes and protein sequences. Nucl. Acids Res. 27, 44–48.
Article PubMed CAS Google Scholar
Burges, C. J. C. (1998) A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2, 121–167.
Article Google Scholar
Jaakola, T., Diekhans, M., and Haussler, D. (1999) Using the Fisher kernel method to detect remote protein homologies. In the Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology 149–158.
Google Scholar
Han, L. Y., Cai, C. Z., Ji, Z. L., Cao, Z. W., Cui, J., and Chen, Y. Z. (2004) Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach. Nucl. Acids Res. 32, 6437–6444.
Article PubMed CAS Google Scholar
Leslie, C., Eskin, E., Cohen, A., Weston, J., and Noble, W. S. (2004) Mismatch string kernels for discriminitive protein classification. Bioinformatics 1, 1–10.
Google Scholar
Ben-Hur, A. and Brutlag, D. L. (2006) Sequence motifs: highly predictive features of protein function, in Feature Extraction, Foundations and Applications (Guyon, I., Gunn, S., Nikravesh, M., and Zadeh, L. eds.), Springer Verlag, New York.
Google Scholar
Kolesov, G., Mewes, H. W., and Frishman, D. (2001) SNAPping up functionally related genes based on context information: a colinearity-free approach. J. Mol. Biol. 311, 639–656.
Article PubMed CAS Google Scholar
Tian, W., Arakaki, A. K., and Skolnick, J. (2004) EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference. Nucl. Acids Res. 32, 6226–6239.
Article PubMed CAS Google Scholar
Levy, E. D., Ouzounis, C. A., Gilks, W. R., and Audit, B. (2005) Probabilistic annotation of protein sequences based on functional classifications. BMC Bioinformatics 6, 302.
Article PubMed Google Scholar
Duda, R. O., Hart, P. E., and Stork, D. G. (2000) Pattern Classification. John Wiley and Sons, New York.
Google Scholar
Mitchell, T. M. (1997) Machine Learning. McGraw-Hill, New York.
Google Scholar
Breiman, L., Friedman, J. H., Olshen, R.A., and Stone, C. J. (1993) Classification and Regression Trees. Chapman and Hall, New York.
Google Scholar
Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Finn R. D., and Sonnhammer E. L. (1999) Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucl. Acids Res. 27, 260–262.
Article PubMed CAS Google Scholar
Bairoch, A. and Apweiler, R. (1999) The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucl. Acids Res. 27, 49–54.
Google Scholar
Hobohm, U. and Sander, C. (1995) A sequence property approach to searching protein database. J. Mol. Biol. 251, 390–399.
Google Scholar
Ferran, E. A., Pflugfelder, B., and Ferrara P. (1994) Self-organized neural maps of human protein sequences. Protein Sci. 3, 507–521.
Article PubMed CAS Google Scholar
Black, S.D. and Mould, D.R. (1991) Development of hydrophobicity parameters to analyze proteins which bear post or cotranslational modifications. Anal. Biochem. 193, 72–82.
Google Scholar
http://www.ionsource.com/virtit/VirtualIT/aainfo.htm
McGuffin, L. J., Bryson, K., and Jones, D. T. (2000) The PSIPRED protein structure prediction server. Bioinformatics 16, 404–405.
Article PubMed CAS Google Scholar
http://www.expasy.org/sprot/userman.html
Quinlan, J.R., (1986) Induction of decision trees. Mach. Learn. 1, 81–106.
Google Scholar
Quinlan, J.R., (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.
Google Scholar
Syed, U. and Yona, G. (2003) Using a mixture of probabilistic decision trees for direct prediction of protein function. In the Proceedings of the 7th Annual International Conference on Research in Computational Molecular Biology 289–300.
Google Scholar
Dietterich, T. G. (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach. Learn. 40, 139–157.
Article Google Scholar
Ho, T. K. (1998) The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20, 832–844.
Article Google Scholar
Breiman, L. (2001) Random forests. Mach. Learn. 45, 5–32, 48
Article Google Scholar
Lin, J. (1991) Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37:1, 145–151.
Article Google Scholar
Kullback, S. (1959) Information Theory and Statistics. John Wiley and Sons, New York.
Google Scholar
Hughey, R., Karplus, K., and Krogh, A. (1999) SAM: sequence alignment and modeling software system. Technical report UCSC-CRL-99-11. University of California, Santa Cruz, CA.
Google Scholar
Birkland, A. and Yona, G. (2006) The BIOZON database: a hub of heterogeneous biological data. Nucl. Acids Res. 34, D235–D242.
Google Scholar
Fayyad, U. M. and Irani, K. B. (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In the Proceedings of the 13th International Joint Conference on Artificial Intelligence 1022–1027.
Google Scholar
Kohavi, R. and Sahami, M. (1996) Error-based and entropy-based discretization of continuous features. In the Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining 114– 119.
Google Scholar
Breiman, L., Friedman, J. H., Olshen, R.A., and Stone, C. J. (1984) Classification and Regression Trees. Wadsworth Int. Group, Belmont, CA.
Google Scholar
Mantaras, R. L. (1991) A distance-based attribute selection measure for decision tree induction. Mach. Learn. 6, 81–92.
Article Google Scholar
Kononenko, I. (1995) On biases in estimating multi-valued attributes. In the Proceedings of the 14th International Joint Conference on Artificial Intelligence 1034–1040.
Google Scholar
Eskin, E., Grundy, W. N., and Singer, Y. (2000) Protein family classification using sparse Markov transducers. In the Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology 20–23.
Google Scholar
Rissanen, J. (1989) Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore.
Google Scholar
Hjorth, J. S. U. (1994) Computer Intensive Statistical Methods: Validation, Model Selection, and Bootstrap. Chapman and Hall, London.
Google Scholar
Jain, A. K., Dubes, R. C., and Chen, C. (1998) Bootstrap techniques for error estimation. IEEE Trans. Pattern Anal. Appl. 9, 628–633.
Article Google Scholar
Shakhnarovich, G., El-Yaniv, R., and Baram, Y. (2001) Smoothed bootstrap and statistical data cloning for classifier evaluation. In the Proceedings of the 18th International Conference on Machine Learning 521–528.
Google Scholar
Pearson, W. R. (1995) Comparison of methods for searching protein sequence databases. Protein Sci. 4, 1145–1160.
Article PubMed CAS Google Scholar

Download references

Acknowledgments

This work is supported by the National Science Foundation under Grant No. 0133311 to Golan Yona.

Author information

Authors and Affiliations

Department of Computer Science, Princeton University, Princeton, NJ, USA
Umar Syed
Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY, USA
Golan Yona
Department of Computer Science, Technion - Israel Institute of Technology, Haifa, Israel
Golan Yona

Authors

Umar Syed
View author publications
You can also search for this author in PubMed Google Scholar
Golan Yona
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Syed, U., Yona, G. (2009). Enzyme Function Prediction with Interpretable Models. In: Ireton, R., Montgomery, K., Bumgarner, R., Samudrala, R., McDermott, J. (eds) Computational Systems Biology. Methods in Molecular Biology, vol 541. Humana Press. https://doi.org/10.1007/978-1-59745-243-4_17

Download citation

DOI: https://doi.org/10.1007/978-1-59745-243-4_17
Published: 10 March 2009
Publisher Name: Humana Press
Print ISBN: 978-1-58829-905-5
Online ISBN: 978-1-59745-243-4
eBook Packages: Springer Protocols

Publish with us

Policies and ethics

Enzyme Function Prediction with Interpretable Models

Abstract

Access this chapter

Similar content being viewed by others

Functional Annotation from Structural Homology

Elucidation of Metabolic Pathways from Enzyme Classification Data

UniKP: a unified framework for the prediction of enzyme kinetic parameters

References

Acknowledgments

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this protocol

Cite this protocol

Download citation

Publish with us

Navigation

Enzyme Function Prediction with Interpretable Models

Abstract

Access this chapter

Similar content being viewed by others

Functional Annotation from Structural Homology

Elucidation of Metabolic Pathways from Enzyme Classification Data

UniKP: a unified framework for the prediction of enzyme kinetic parameters

References

Acknowledgments

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this protocol

Cite this protocol

Download citation

Publish with us

Search

Navigation