Skip to main content

Searching Maximal Degenerate Motifs Guided by a Compact Suffix Tree

  • Conference paper
  • First Online:
Advances in Computational Biology

Part of the book series: Advances in Experimental Medicine and Biology ((AEMB,volume 680))

Abstract

Compared to a mismatched consensus motif, a degenerate consensus motif is more suitable for modeling position-specific variations within motifs. In the literature, the state-of-art methods using degenerate consensus motifs for de novo motif finding use a naïve enumeration algorithm, which is far from efficient. In this paper, we propose an efficient algorithm to extract maximal degenerate consensus motifs from a set of sequences based on a compact suffix tree. Our algorithm achieved a time complexity about \( n \) times lower than that of a naïve enumeration, where \( n \) is the average length of source sequences. To demonstrate the efficiency and effectiveness of our proposed algorithm, we applied it to finding transcription factor binding sites. It is validated on a popular benchmark proposed by Tompa. The executable files of our algorithm can be accessed through http://hpc.cs.tsinghua.edu.cn/bioinfo.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 219.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 279.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 379.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A tree where each edge is labeled by a single symbol.

  2. 2.

    The original statement is: \( O({n^2}N(N - k + 1)\left( \begin{array}{l} k \\g \end{array} \right)) \), since they use a specific \( {\Sigma_E} \).

  3. 3.

    Where multiple runs were conducted, each with a fixed length within the length range.

References

  1. Stormo, G.D.: ‘DNA binding sites: representation and discovery’, Bioinformatics, 2000, 16, (1), pp. 16–23

    Article  PubMed  CAS  Google Scholar 

  2. Bussemaker, H.J., Li, H., and Siggia, E.D.: ‘From the cover: building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis’, Proc Natl Acad Sci USA, 2000, 97, (18), pp. 10096–10100

    Article  PubMed  CAS  Google Scholar 

  3. Sinha, S., and Tompa, M.: ‘Discovery of novel transcription factor binding sites by statistical overrepresentation’, Nucleic Acids Res, 2002, 30, (24), pp. 5549–5560

    Article  PubMed  CAS  Google Scholar 

  4. Sagot, M-F.: ‘Spelling approximate repeated or common motifs using a suffix tree’. In ‘Proceedings of the 1998 3rd Latin American Symposium, Apr 20–24 1998’ (1998), p. 374

    Google Scholar 

  5. Marsan, L., and Sagot, M-F.: ‘Extracting structured motifs using a suffix tree – algorithms and application to promoter consensus identification’. In ‘Extracting structured motifs using a suffix tree – algorithms and application to promoter consensus identification’ (ACM, 2000), pp. 210–219

    Google Scholar 

  6. Pavesi, G., Mauri, G., and Pesole, G.: ‘An algorithm for finding signals of unknown length in DNA sequences’, Bioinformatics, 2001, 17, (suppl 1), pp. S207–S214

    Article  PubMed  Google Scholar 

  7. Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., and Wootton, J.C.: ‘Detecting subtle sequence signals. A Gibbs sampling strategy for multiple alignment’, Science, 1993, 262, (5131), p. 208

    Article  PubMed  CAS  Google Scholar 

  8. Bailey, T.L., and Elkan, C.: ‘Fitting a mixture model by expectation maximization to discover motifs in biopolymers’, Proc Int Conf Intell Syst Mol Biol, 1994, 2, pp. 28–36

    PubMed  CAS  Google Scholar 

  9. Hertz, G., and Stormo, G.: ‘Identifying DNA and protein patterns with statistically significant alignments of multiple sequences’, Bioinformatics, 1999, 15, (7), pp. 563–577

    Article  PubMed  CAS  Google Scholar 

  10. Vishnevsky, O.V., and Kolchanov, N.A.: ‘ARGO: a web system for the detection of degenerate motifs and large-scale recognition of eukaryotic promoters’, Nucleic Acids Res, 2005, 33, (suppl 2), pp. W417–W422

    Article  PubMed  CAS  Google Scholar 

  11. Peng, C.H., Hsu, J.T., Chung, Y.S., Lin, Y.J., Chow, W.Y., Hsu, D.F., and Tang, C.Y.: ‘Identification of degenerate motifs using position restricted selection and hybrid ranking combination’, Nucleic Acids Res, 2006, 34, (22), pp. 6379–6391

    Article  PubMed  CAS  Google Scholar 

  12. Sandve, G.K., Abul, O., Walseng, V., et al.: ‘Improved benchmarks for computational motif discovery,’ BMC Bioinformatics, 8, p. 193, 2007

    Article  PubMed  Google Scholar 

  13. Weiner, P.: ‘Linear pattern matching algorithms’. In ‘Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973), Volume 001973’, pp. 1–11

    Google Scholar 

  14. Tompa, M., Li, N., Bailey, T.L., et al.: ‘Assessing computational tools for the discovery of transcription factor binding sites’, Nat Biotechnol, 2005, 23, (1), pp. 137–144

    Article  PubMed  CAS  Google Scholar 

  15. Ukkonen, E.: ‘On-line construction of suffix trees’, Algorithmica (New York), 1995, 14, (3), p. 249

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongshan Jiang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer Science+Business Media, LLC

About this paper

Cite this paper

Jiang, H., Zhao, Y., Chen, W., Zheng, W. (2010). Searching Maximal Degenerate Motifs Guided by a Compact Suffix Tree. In: Arabnia, H. (eds) Advances in Computational Biology. Advances in Experimental Medicine and Biology, vol 680. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-5913-3_3

Download citation

Publish with us

Policies and ethics