Skip to main content

Enzyme Function Prediction with Interpretable Models

  • Protocol
  • First Online:
Computational Systems Biology

Part of the book series: Methods in Molecular Biology ((MIMB,volume 541))

Abstract

Enzymes play central roles in metabolic pathways, and the prediction of metabolic pathways in newly sequenced genomes usually starts with the assignment of genes to enzymatic reactions. However, genes with similar catalytic activity are not necessarily similar in sequence, and therefore the traditional sequence similarity-based approach often fails to identify the relevant enzymes, thus hindering efforts to map the metabolome of an organism.

Here we study the direct relationship between basic protein properties and their function. Our goal is to develop a new tool for functional prediction (e.g., prediction of Enzyme Commission number), which can be used to complement and support other techniques based on sequence or structure information. In order to define this mapping we collected a set of 453 features and properties that characterize proteins and are believed to be related to structural and functional aspects of proteins. We introduce a mixture model of stochastic decision trees to learn the set of potentially complex relationships between features and function. To study these correlations, trees are created and tested on the Pfam classification of proteins, which is based on sequence, and the EC classification, which is based on enzymatic function. The model is very effective in learning highly diverged protein families or families that are not defined on the basis of sequence. The resulting tree structures highlight the properties that are strongly correlated with structural and functional aspects of protein families, and can be used to suggest a concise definition of a protein family.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Kanehisa, M. and Goto, S. (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucl. Acids Res. 28, 27–30.

    Article  PubMed  CAS  Google Scholar 

  2. Caspi, R., Foerster, H., Fulcher, C.A., Hopkinson, R., Ingraham, J., Kaipa, P., Krummenacker, M., Paley, S., Pick, J., Rhee, S. Y., Tissier, C., Zhang, P., and Karp, P. D. (2006) MetaCyc: a multiorganism database of metabolic pathways and enzymes. Nucl. Acids Res. 34, D511–D516.

    Article  PubMed  CAS  Google Scholar 

  3. Paley, S. M. and Karp, P.D. (2002) Evaluation of computational metabolic-pathway predictions for Helicobacter pylori. Bioinformatics 18, 715–724.

    Google Scholar 

  4. Bono, H., Ogata, H., Goto, S., and Kanehisa, M. (1998) Reconstruction of amino acid biosynthesis pathways from the complete genome sequence. Genome Res. 8, 203–210.

    PubMed  CAS  Google Scholar 

  5. Green, M. and Karp, P. D. (2004) A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformatics 5, 76.

    Google Scholar 

  6. Chen, L. and Vitkup, D. (2006) Predicting genes for orphan metabolic activities using phylogenetic profiles. Genome Biol. 7, R17.

    Google Scholar 

  7. Kharchenko, P., Chen, L., Freund, Y., Vitkup, D., and Church, G. M. (2006) Identifying metabolic enzymes with multiple types of association evidence. BMC Bioinformatics 7, 177.

    Article  PubMed  Google Scholar 

  8. Popescu, L. and Yona, G. (2005) Automation of gene assignments to metabolic pathways using high-throughput expression data. BMC Bioinformatics 6, 217.

    Google Scholar 

  9. Popescu, L. and Yona, G. (2006) Expectation-maximization algorithms for fuzzy assignment of genes to cellular pathways. In proceedings of the 2006 Computational Systems Bioinformatics Conference.

    Google Scholar 

  10. Yaminishi, Y., Vert, J., and Kanehisa, M. (2005) Supervised enzyme network inference from the integration of genomic data and chemical information. Bioinformatics 21, i468–i477.

    Article  Google Scholar 

  11. http://www.chem.qmw.ac.uk/iubmb/enzyme/

  12. Shah, I. and Hunter, L. (1997) Predicting enzyme function from sequence: a systematic appraisal. In the Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology 276–283.

    Google Scholar 

  13. Wilson, D. B. and Irwin, D. C. (1999) Genetics and properties of cellulases. Adv. Biochem. Eng. 65, 2–21.

    Google Scholar 

  14. Stawiski, E. W., Baucom, A. E., Lohr, S. C., and Gregoret, L. M. (2000) Predicting protein function from structure: unique structural features of proteases. Proc. Natl. Acad. Sci. U.S.A. 97, 3954–3958.

    Google Scholar 

  15. Todd, A. E., Orengo, C. A., and Thornton, J. M. (2001) Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307, 1113–1143.

    Article  PubMed  CAS  Google Scholar 

  16. Devos, D. and Valencia, A. (2000) Practical limits of function prediction. Prot. Struct. Func. Genet. 41, 98–107.

    Google Scholar 

  17. Holm, L. and Sander, C. (1994) The FSSP database of structurally aligned protein fold families. Nucl. Acids Res. 22, 3600–3609.

    Google Scholar 

  18. Wilson, C. A., Kreychman, J., and Gerstein, M. (2000) Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J. Mol. Biol. 297, 233–249.

    Article  PubMed  CAS  Google Scholar 

  19. Murzin A. G., Brenner S. E., Hubbard T., Chothia C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540.

    PubMed  CAS  Google Scholar 

  20. Rost, B. (2002) Enzyme function less conserved than anticipated. J. Mol. Biol. 318, 595–608.

    Article  PubMed  CAS  Google Scholar 

  21. desJardins, M., Karp, P. D., Krummenacker, M., Lee, T. J., and Ouzounis, C. A. (1997) Prediction of enzyme classification from protein sequence without the use of sequence similarity. In the Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology 92–99.

    Google Scholar 

  22. Borro, L. C., Oliveira, S. R. M., Yamagishi, M. E. B., Mancini, A. L., Jardine, J. G., Mazoni, I., dos Santos, E. H., Higa, R. H., Kuser P. R., and Neshich G. (2006) Predicting enzyme class from protein structure using Bayesian classification. Genet. Mol. Res. 5, 193–202.

    Google Scholar 

  23. Cai, Y-D. and Chou, K-C. (2004) Using functional domain composition to predict enzyme family classes. J. Proteome Res. 4, 109–111.

    Google Scholar 

  24. The Gene Ontology Consortium. (2000) Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29.

    Article  Google Scholar 

  25. Clare, A. and King R. D. (2003) Predicting gene function in Saccharomyces cerevisiae. Bioinformatics 19, ii42–ii49

    Article  PubMed  Google Scholar 

  26. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25, 3389–3402.

    Article  PubMed  CAS  Google Scholar 

  27. Mewes, H. W., Heumann, K., Kaps, A., Mayer, K., Pfeiffer, F., Stocker, S., and Frishman, D. (1999) MIPS: a database for genomes and protein sequences. Nucl. Acids Res. 27, 44–48.

    Article  PubMed  CAS  Google Scholar 

  28. Burges, C. J. C. (1998) A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2, 121–167.

    Article  Google Scholar 

  29. Jaakola, T., Diekhans, M., and Haussler, D. (1999) Using the Fisher kernel method to detect remote protein homologies. In the Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology 149–158.

    Google Scholar 

  30. Han, L. Y., Cai, C. Z., Ji, Z. L., Cao, Z. W., Cui, J., and Chen, Y. Z. (2004) Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach. Nucl. Acids Res. 32, 6437–6444.

    Article  PubMed  CAS  Google Scholar 

  31. Leslie, C., Eskin, E., Cohen, A., Weston, J., and Noble, W. S. (2004) Mismatch string kernels for discriminitive protein classification. Bioinformatics 1, 1–10.

    Google Scholar 

  32. Ben-Hur, A. and Brutlag, D. L. (2006) Sequence motifs: highly predictive features of protein function, in Feature Extraction, Foundations and Applications (Guyon, I., Gunn, S., Nikravesh, M., and Zadeh, L. eds.), Springer Verlag, New York.

    Google Scholar 

  33. Kolesov, G., Mewes, H. W., and Frishman, D. (2001) SNAPping up functionally related genes based on context information: a colinearity-free approach. J. Mol. Biol. 311, 639–656.

    Article  PubMed  CAS  Google Scholar 

  34. Tian, W., Arakaki, A. K., and Skolnick, J. (2004) EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference. Nucl. Acids Res. 32, 6226–6239.

    Article  PubMed  CAS  Google Scholar 

  35. Levy, E. D., Ouzounis, C. A., Gilks, W. R., and Audit, B. (2005) Probabilistic annotation of protein sequences based on functional classifications. BMC Bioinformatics 6, 302.

    Article  PubMed  Google Scholar 

  36. Duda, R. O., Hart, P. E., and Stork, D. G. (2000) Pattern Classification. John Wiley and Sons, New York.

    Google Scholar 

  37. Mitchell, T. M. (1997) Machine Learning. McGraw-Hill, New York.

    Google Scholar 

  38. Breiman, L., Friedman, J. H., Olshen, R.A., and Stone, C. J. (1993) Classification and Regression Trees. Chapman and Hall, New York.

    Google Scholar 

  39. Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Finn R. D., and Sonnhammer E. L. (1999) Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucl. Acids Res. 27, 260–262.

    Article  PubMed  CAS  Google Scholar 

  40. Bairoch, A. and Apweiler, R. (1999) The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucl. Acids Res. 27, 49–54.

    Google Scholar 

  41. Hobohm, U. and Sander, C. (1995) A sequence property approach to searching protein database. J. Mol. Biol. 251, 390–399.

    Google Scholar 

  42. Ferran, E. A., Pflugfelder, B., and Ferrara P. (1994) Self-organized neural maps of human protein sequences. Protein Sci. 3, 507–521.

    Article  PubMed  CAS  Google Scholar 

  43. Black, S.D. and Mould, D.R. (1991) Development of hydrophobicity parameters to analyze proteins which bear post or cotranslational modifications. Anal. Biochem. 193, 72–82.

    Google Scholar 

  44. http://www.ionsource.com/virtit/VirtualIT/aainfo.htm

  45. McGuffin, L. J., Bryson, K., and Jones, D. T. (2000) The PSIPRED protein structure prediction server. Bioinformatics 16, 404–405.

    Article  PubMed  CAS  Google Scholar 

  46. http://www.expasy.org/sprot/userman.html

  47. Quinlan, J.R., (1986) Induction of decision trees. Mach. Learn. 1, 81–106.

    Google Scholar 

  48. Quinlan, J.R., (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.

    Google Scholar 

  49. Syed, U. and Yona, G. (2003) Using a mixture of probabilistic decision trees for direct prediction of protein function. In the Proceedings of the 7th Annual International Conference on Research in Computational Molecular Biology 289–300.

    Google Scholar 

  50. Dietterich, T. G. (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach. Learn. 40, 139–157.

    Article  Google Scholar 

  51. Ho, T. K. (1998) The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20, 832–844.

    Article  Google Scholar 

  52. Breiman, L. (2001) Random forests. Mach. Learn. 45, 5–32, 48

    Article  Google Scholar 

  53. Lin, J. (1991) Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37:1, 145–151.

    Article  Google Scholar 

  54. Kullback, S. (1959) Information Theory and Statistics. John Wiley and Sons, New York.

    Google Scholar 

  55. Hughey, R., Karplus, K., and Krogh, A. (1999) SAM: sequence alignment and modeling software system. Technical report UCSC-CRL-99-11. University of California, Santa Cruz, CA.

    Google Scholar 

  56. Birkland, A. and Yona, G. (2006) The BIOZON database: a hub of heterogeneous biological data. Nucl. Acids Res. 34, D235–D242.

    Google Scholar 

  57. Fayyad, U. M. and Irani, K. B. (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In the Proceedings of the 13th International Joint Conference on Artificial Intelligence 1022–1027.

    Google Scholar 

  58. Kohavi, R. and Sahami, M. (1996) Error-based and entropy-based discretization of continuous features. In the Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining 114– 119.

    Google Scholar 

  59. Breiman, L., Friedman, J. H., Olshen, R.A., and Stone, C. J. (1984) Classification and Regression Trees. Wadsworth Int. Group, Belmont, CA.

    Google Scholar 

  60. Mantaras, R. L. (1991) A distance-based attribute selection measure for decision tree induction. Mach. Learn. 6, 81–92.

    Article  Google Scholar 

  61. Kononenko, I. (1995) On biases in estimating multi-valued attributes. In the Proceedings of the 14th International Joint Conference on Artificial Intelligence 1034–1040.

    Google Scholar 

  62. Eskin, E., Grundy, W. N., and Singer, Y. (2000) Protein family classification using sparse Markov transducers. In the Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology 20–23.

    Google Scholar 

  63. Rissanen, J. (1989) Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore.

    Google Scholar 

  64. Hjorth, J. S. U. (1994) Computer Intensive Statistical Methods: Validation, Model Selection, and Bootstrap. Chapman and Hall, London.

    Google Scholar 

  65. Jain, A. K., Dubes, R. C., and Chen, C. (1998) Bootstrap techniques for error estimation. IEEE Trans. Pattern Anal. Appl. 9, 628–633.

    Article  Google Scholar 

  66. Shakhnarovich, G., El-Yaniv, R., and Baram, Y. (2001) Smoothed bootstrap and statistical data cloning for classifier evaluation. In the Proceedings of the 18th International Conference on Machine Learning 521–528.

    Google Scholar 

  67. Pearson, W. R. (1995) Comparison of methods for searching protein sequence databases. Protein Sci. 4, 1145–1160.

    Article  PubMed  CAS  Google Scholar 

Download references

Acknowledgments

This work is supported by the National Science Foundation under Grant No. 0133311 to Golan Yona.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Humana Press, a part of Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Syed, U., Yona, G. (2009). Enzyme Function Prediction with Interpretable Models. In: Ireton, R., Montgomery, K., Bumgarner, R., Samudrala, R., McDermott, J. (eds) Computational Systems Biology. Methods in Molecular Biology, vol 541. Humana Press. https://doi.org/10.1007/978-1-59745-243-4_17

Download citation

  • DOI: https://doi.org/10.1007/978-1-59745-243-4_17

  • Published:

  • Publisher Name: Humana Press

  • Print ISBN: 978-1-58829-905-5

  • Online ISBN: 978-1-59745-243-4

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics