Searching Maximal Degenerate Motifs Guided by a Compact Suffix Tree

Jiang, Hongshan; Zhao, Ying; Chen, Wenguang; Zheng, Weimin

doi:10.1007/978-1-4419-5913-3_3

Hongshan Jiang²,
Ying Zhao,
Wenguang Chen &
…
Weimin Zheng

Part of the book series: Advances in Experimental Medicine and Biology ((AEMB,volume 680))

2491 Accesses
6 Citations

Abstract

Compared to a mismatched consensus motif, a degenerate consensus motif is more suitable for modeling position-specific variations within motifs. In the literature, the state-of-art methods using degenerate consensus motifs for de novo motif finding use a naïve enumeration algorithm, which is far from efficient. In this paper, we propose an efficient algorithm to extract maximal degenerate consensus motifs from a set of sequences based on a compact suffix tree. Our algorithm achieved a time complexity about \( n \) times lower than that of a naïve enumeration, where \( n \) is the average length of source sequences. To demonstrate the efficiency and effectiveness of our proposed algorithm, we applied it to finding transcription factor binding sites. It is validated on a popular benchmark proposed by Tompa. The executable files of our algorithm can be accessed through http://hpc.cs.tsinghua.edu.cn/bioinfo.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 219.00; Price excludes VAT (USA)

Softcover Book: USD 279.99; Price excludes VAT (USA)

Hardcover Book: USD 379.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A tree where each edge is labeled by a single symbol.
2.
The original statement is: \( O({n^2}N(N - k + 1)\left( \begin{array}{l} k \\g \end{array} \right)) \), since they use a specific \( {\Sigma_E} \).
3.
Where multiple runs were conducted, each with a fixed length within the length range.

References

Stormo, G.D.: ‘DNA binding sites: representation and discovery’, Bioinformatics, 2000, 16, (1), pp. 16–23
Article PubMed CAS Google Scholar
Bussemaker, H.J., Li, H., and Siggia, E.D.: ‘From the cover: building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis’, Proc Natl Acad Sci USA, 2000, 97, (18), pp. 10096–10100
Article PubMed CAS Google Scholar
Sinha, S., and Tompa, M.: ‘Discovery of novel transcription factor binding sites by statistical overrepresentation’, Nucleic Acids Res, 2002, 30, (24), pp. 5549–5560
Article PubMed CAS Google Scholar
Sagot, M-F.: ‘Spelling approximate repeated or common motifs using a suffix tree’. In ‘Proceedings of the 1998 3rd Latin American Symposium, Apr 20–24 1998’ (1998), p. 374
Google Scholar
Marsan, L., and Sagot, M-F.: ‘Extracting structured motifs using a suffix tree – algorithms and application to promoter consensus identification’. In ‘Extracting structured motifs using a suffix tree – algorithms and application to promoter consensus identification’ (ACM, 2000), pp. 210–219
Google Scholar
Pavesi, G., Mauri, G., and Pesole, G.: ‘An algorithm for finding signals of unknown length in DNA sequences’, Bioinformatics, 2001, 17, (suppl 1), pp. S207–S214
Article PubMed Google Scholar
Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., and Wootton, J.C.: ‘Detecting subtle sequence signals. A Gibbs sampling strategy for multiple alignment’, Science, 1993, 262, (5131), p. 208
Article PubMed CAS Google Scholar
Bailey, T.L., and Elkan, C.: ‘Fitting a mixture model by expectation maximization to discover motifs in biopolymers’, Proc Int Conf Intell Syst Mol Biol, 1994, 2, pp. 28–36
PubMed CAS Google Scholar
Hertz, G., and Stormo, G.: ‘Identifying DNA and protein patterns with statistically significant alignments of multiple sequences’, Bioinformatics, 1999, 15, (7), pp. 563–577
Article PubMed CAS Google Scholar
Vishnevsky, O.V., and Kolchanov, N.A.: ‘ARGO: a web system for the detection of degenerate motifs and large-scale recognition of eukaryotic promoters’, Nucleic Acids Res, 2005, 33, (suppl 2), pp. W417–W422
Article PubMed CAS Google Scholar
Peng, C.H., Hsu, J.T., Chung, Y.S., Lin, Y.J., Chow, W.Y., Hsu, D.F., and Tang, C.Y.: ‘Identification of degenerate motifs using position restricted selection and hybrid ranking combination’, Nucleic Acids Res, 2006, 34, (22), pp. 6379–6391
Article PubMed CAS Google Scholar
Sandve, G.K., Abul, O., Walseng, V., et al.: ‘Improved benchmarks for computational motif discovery,’ BMC Bioinformatics, 8, p. 193, 2007
Article PubMed Google Scholar
Weiner, P.: ‘Linear pattern matching algorithms’. In ‘Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973), Volume 001973’, pp. 1–11
Google Scholar
Tompa, M., Li, N., Bailey, T.L., et al.: ‘Assessing computational tools for the discovery of transcription factor binding sites’, Nat Biotechnol, 2005, 23, (1), pp. 137–144
Article PubMed CAS Google Scholar
Ukkonen, E.: ‘On-line construction of suffix trees’, Algorithmica (New York), 1995, 14, (3), p. 249
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Institute of High Performance Computing, Tsinghua University, Beijing, 100084, China
Hongshan Jiang

Authors

Hongshan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Ying Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Wenguang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Weimin Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongshan Jiang .

Editor information

Editors and Affiliations

Dept. Computer Science, University of Georgia, Athens, 30602-7404, Georgia, USA
Hamid R. Arabnia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, H., Zhao, Y., Chen, W., Zheng, W. (2010). Searching Maximal Degenerate Motifs Guided by a Compact Suffix Tree. In: Arabnia, H. (eds) Advances in Computational Biology. Advances in Experimental Medicine and Biology, vol 680. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-5913-3_3

Download citation

DOI: https://doi.org/10.1007/978-1-4419-5913-3_3
Published: 09 August 2010
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4419-5912-6
Online ISBN: 978-1-4419-5913-3
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics