In Linux we trust!

BLAST algorithm for strings

BLAST - basic local alignment search tool. Algorithm for comparing primary biological sequence information, says Wikipedia. Basically, it's about searching of common molecular subsequences. But we will mess with strings, with searching longest common substrings of 2 strings.

Description

Input of algorithm looks in the following way:

BLAST input format

s1 and s2 - input strings. m is minimal substring length.

Output element:

Output format element

l - found substring length. s1i and s2i substrings' positions inside corresponding string.

Over all algorithm output:

BLAST output format

Let's split s1 string into all possible substrings with length m. After that get all positions of substrings inside s2. String intersection map for our example looks like this:

String intersection map

Obviously, after building string intersection map algorithm should have loop with nesting depth of three. Let's list'em all:

  1. Iterate over intersection map layer by layer
  2. Iterate over each possible position inside the layer
  3. Search maximum possible length for current position

On picture above i already denoted the longest substring, that could be found, in order to demonstrate the logic of algorithm.

Multiple strings

What if we have one more string to compare with, let's say s3. In this case the best way IMO is 3-step search process:

  1. get r12 - BLAST results for s1 and s2
  2. get r23 - BLAST results for s2 and s3
  3. Merge results of r12 and r23

While merging results we can face situation where results of s3 can be substrings of s2 (or vice versa) =)

Merge multiple results

Indices for s1 and s2 are aligned to smallest length (5). For third string to result tuple is added one more index, that is said to be substring's position for third string.

Tokenization

In case of strings represent structured homogeneous data, we can transform strings to a sequence of 8-bit tokens, that are easily compared using BLAST. And much faster.

For above example, let's extract alphanumeric values from the string:

Tokenize string

To avoid token collisions you can use CRC32 instead of CRC8 (as shown above). Also you could notice that The and the have the same value, since we lowercased the string before tokenizing. Finally, the length of string is significantly decreased from 44 to 9.

A few words about HTML strings and their tokenizing process. Taking from HTML content just alphanumeric values gives us a lot of trash.

Alphanumeric values of HTML

Empower your regular expression with look behind in such way:

/(?<!<|\/)\b\w+\b(?!=)/

Tags and attribute names are not included now:

HTML tokens


That's all for now. Next article i'll show PHP BLAST algorithm implementation and also will provide some examples to experiment with. Stay tuned!

Next Article: clck