Navigation

Search

Categories

On this page

Regular Expression, Bioinfomatics, and Partial Matching
Regular Expression and Free data on the Internet

Archive

Blogroll

Disclaimer
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

RSS 2.0 | Atom 1.0 | CDF

Send mail to the author(s) E-mail

Total Posts: 75
This Year: 13
This Month: 0
This Week: 0
Comments: 29

Sign In
Pick a theme:

 Thursday, August 03, 2006
Thursday, August 03, 2006 4:59:12 PM (台北標準時間, UTC+08:00) ( .NET Programming )

I have not studied deeply about partial matching. Someone asked about using regular expression for partial matching ACGT-like pattern and strings. I wrote a simple program to test my idea about partial matching using regular expression.

The first effort is on brute force partial matching. I prepared all the possible patterns and match them to the target string. It worked fine if you can only accept 1 tolerrance (1 doesn't match) since the number of patterns to match is equal to the length of original pattern. But if you want more than 1 tolerrance, the number of patterns goes up to C(pattern length, tolerrance), it might drive you crazy if you have a pattern more than 100.

I tried to use another technique "Check Appearance". I don't know whether it has an official or scientific name. First I get a substring which length is the same as pattern from target string. Then I compare the frequency of A, C, G, T in the pattern and the new string. If their differerence are below the tolerrance, than I can do the RegEx matching. I found it is useful especially the pattern is long and tolerrance is low. I should try some other methods if I have time.

I attach my all source code here. Please understand the code is developed for quick testing, not a full functional release. There are many bugs and accuracy of the result must be checked. Please let me know if you find any bugs.

Download link has been moved to JumboGuide.

 Tuesday, August 01, 2006
Tuesday, August 01, 2006 11:26:43 PM (台北標準時間, UTC+08:00) ( .NET Programming )

There are many free data on the net, but nobody collect and use them. I am starting a project to collect various data on the net. Using regular expression provided by .NET, I think it is quite simple to collect web data. I have put related works on http://www.derivativepower.com and http://www.jumboguide.com