s2 - input strings.
m is minimal substring length.
l - found substring length.
s2i substrings' positions inside corresponding string.
Over all algorithm output:
s1 string into all possible substrings with length
m. After that get all positions of substrings inside
s2. String intersection map for our example looks like this:
Obviously, after building string intersection map algorithm should have loop with nesting depth of three. Let's list'em all:
On picture above i already denoted the longest substring, that could be found, in order to demonstrate the logic of algorithm.
What if we have one more string to compare with, let's say
s3. In this case the best way IMO is 3-step search process:
r12- BLAST results for
r23- BLAST results for
While merging results we can face situation where results of
s3 can be substrings of
s2 (or vice versa) =)
s2 are aligned to smallest length (5). For third string to result tuple is added one more index, that is said to be substring's position for third string.
In case of strings represent structured homogeneous data, we can transform strings to a sequence of 8-bit tokens, that are easily compared using BLAST. And much faster.
For above example, let's extract alphanumeric values from the string:
To avoid token collisions you can use CRC32 instead of CRC8 (as shown above). Also you could notice that
the have the same value, since we lowercased the string before tokenizing. Finally, the length of string is significantly decreased from 44 to 9.
A few words about HTML strings and their tokenizing process. Taking from HTML content just alphanumeric values gives us a lot of trash.
Empower your regular expression with look behind in such way:
Tags and attribute names are not included now:
That's all for now. Next article i'll show PHP BLAST algorithm implementation and also will provide some examples to experiment with. Stay tuned!
Next Article: clck