Home All Groups Group Topic Archive Search About

text compare takes very long...

Author
11 Oct 2007 11:36 AM
Co
HI All,

I use the LevenshteinDistance function two compare two strings and
come back
with the number of differences in the strings.
For example LevenshteinDistance("Marco", "Co") will return 3.
The problem is that the code is very slow. I tried StrComp but that
will not return the number
of differences in the strings, just 1 or -1.

Is there some faster way of doing what I do?

Marco

Author
11 Oct 2007 11:41 AM
Donald Lessau
Show quote
"Co" <vonclausow***@gmail.com> schrieb im Newsbeitrag
news:1192102578.057400.109350@y42g2000hsy.googlegroups.com...
> HI All,
>
> I use the LevenshteinDistance function two compare two strings and
> come back
> with the number of differences in the strings.
> For example LevenshteinDistance("Marco", "Co") will return 3.
> The problem is that the code is very slow. I tried StrComp but that
> will not return the number
> of differences in the strings, just 1 or -1.
>
> Is there some faster way of doing what I do?
>

Post the code of your LevenshteinDistance() function, then I can tell you
whether mine is faster.

Don
Author
11 Oct 2007 12:04 PM
Co
Show quote
On 11 okt, 13:41, "Donald Lessau" <d***@oflex.com> wrote:
> "Co" <vonclausow***@gmail.com> schrieb im Newsbeitragnews:1192102578.057400.109***@y42g2000hsy.googlegroups.com...
>
> > HI All,
>
> > I use the LevenshteinDistance function two compare two strings and
> > come back
> > with the number of differences in the strings.
> > For example LevenshteinDistance("Marco", "Co") will return 3.
> > The problem is that the code is very slow. I tried StrComp but that
> > will not return the number
> > of differences in the strings, just 1 or -1.
>
> > Is there some faster way of doing what I do?
>
> Post the code of your LevenshteinDistance() function, then I can tell you
> whether mine is faster.
>
> Don

Thanks Don,

Option Explicit: DefObj A-Z

Function LevenshteinDistance(String1 As String, String2 As String) As
Integer
' This code was originally written by
' Doug Steele, MVP  AccessH***@rogers.com
' http://I.Am/DougSteele
' You are free to use it in any application
' provided the copyright notice is left unchanged.
'
' Description: Computes the Levenshtein Distance between two strings.
'              The Levenshtein Distance (LD) is a measure of the
similarity
'              between two strings. The distance is the number of
deletions,
'              insertions of substitutions required to transform the
first
'              string into the seond string. The greater the distance,
the more
'              different the strings are.
'
'              The algorithm is:
'                 1. Assuming you have two strings s1 and s2. Set n to
the length
'                    of s1 and m to the length of s2. If the length of
either string
'                    is 0, the LD is the length of the other string.
In other words,
'                    if n = 0, LD = m and if m = 0, LD = n.
'                 2. Assuming both strings have non-0 lengths,
construct an
'                    array containing 0..n rows and 0..m columns.
Initialize the
'                    first row to the values 0..m and the first column
to the values
'                    0..n
'                 3. Examine each character in string s1 (i = 1 to n)
'                 4. Examine each character in string s2 (j = 1 to m)
'                 5. If s1(i) = s2(j), the cost is 0. Otherwise, the
cost is 1.
'                 6. Set cell a(i,j) of the matrix to the minimum of:
'                      a. The cell immediately above plus 1
a(i-1,j) + 1
'                      b. The cell immediately to the left plus 1
a(i,j-1) + 1
'                      c. The cell diagonally above and to the left
a(i-1,j-1) + cost
'                         plus the cost
'                 7. After completing all iterations, the distance is
found in cell a(n,m)
'
' Input:       String1  String   String to be compared
'              String2  String   String to be compared
'
' Returns:     An integer representing the "distance" between the two
strings

On Error GoTo Err_LevenshteinDistance

Dim intCost As Integer
Dim intDistance() As Integer
Dim intLenString1 As Integer
Dim intLenString2 As Integer
Dim intLoop1 As Integer
Dim intLoop2 As Integer
Dim strCurrCharString1 As String
Dim strCurrCharString2 As String

' Step 1: If length of String1 is 0, return length of String2.
'         If length of String2 is 0, return length of String1.

    intLenString1 = Len(String1)
    intLenString2 = Len(String2)
    If intLenString1 = 0 Then
        LevenshteinDistance = intLenString2
    ElseIf intLenString2 = 0 Then
        LevenshteinDistance = intLenString1
    Else

' Step 2: Construct a matrix of the appropriate size.
'         Initialize the first row and first column.

        ReDim intDistance(0 To intLenString1, 0 To intLenString2)
        For intLoop1 = 0 To intLenString1
            intDistance(intLoop1, 0) = intLoop1
        Next intLoop1
        For intLoop2 = 0 To intLenString2
            intDistance(0, intLoop2) = intLoop2
        Next intLoop2

' Step 3: Examine each character of String1

        For intLoop1 = 1 To intLenString1
            strCurrCharString1 = Mid$(String1, intLoop1, 1)

' Step 4: Examine each character of String4

            For intLoop2 = 1 To intLenString2
                strCurrCharString2 = Mid$(String2, intLoop2, 1)

' Step 5: Calculate cost (if characters are equal, cost is 0,
otherwise it's 1)

                intCost = IIf(strCurrCharString1 = strCurrCharString2,
0, 1)

' Step 6: Set cell i,j of the matrix equal to the minimum of:
'            a) the cell immediately above (i-1,j) plus 1
'            b) the cell immediately to the left (i,j-1) plus 1
'            c) the cell diagonally above and to the left (i-1,j-1)
plus the cost

                intDistance(intLoop1, intLoop2) =
Minimum(intDistance(intLoop1 - 1, intLoop2) + 1, _
                    intDistance(intLoop1, intLoop2 - 1) + 1, _
                    intDistance(intLoop1 - 1, intLoop2 - 1) + intCost)

            Next intLoop2

        Next intLoop1

    End If

End_LevenshteinDistance:
' The Levenshein distance is the value in the cell in the bottom right-
hand corner
' of the matrix
    LevenshteinDistance = intDistance(intLenString1, intLenString2)
    Exit Function

Err_LevenshteinDistance:
    Err.Raise Err.number, "LevenshteinDistance", Err.Description
    Resume End_LevenshteinDistance

End Function

Private Function Minimum(ByVal i As Integer, ByVal J As Integer, ByVal
k As Integer) As Integer
' returns the minimum of three values
On Error GoTo End_Minimum

Dim intMin As Integer

    intMin = i
    If J < intMin Then
        intMin = J
    End If
    If k < intMin Then
        intMin = k
    End If

End_Minimum:
    Minimum = intMin
    Exit Function

Err_Minimum:
    Err.Raise Err.number, "Minimum", Err.Description
    Resume End_Minimum

End Function

Marco
Author
11 Oct 2007 6:39 PM
Donald Lessau
Show quote
"Co" <vonclausow***@gmail.com> schrieb im Newsbeitrag
news:1192104286.438756.24070@o80g2000hse.googlegroups.com...
> On 11 okt, 13:41, "Donald Lessau" <d***@oflex.com> wrote:
>> "Co" <vonclausow***@gmail.com> schrieb im
>> Newsbeitragnews:1192102578.057400.109***@y42g2000hsy.googlegroups.com...
>>
>> > HI All,
>>
>> > I use the LevenshteinDistance function two compare two strings and
>> > come back
>> > with the number of differences in the strings.
>> > For example LevenshteinDistance("Marco", "Co") will return 3.
>> > The problem is that the code is very slow. I tried StrComp but that
>> > will not return the number
>> > of differences in the strings, just 1 or -1.
>>
>> > Is there some faster way of doing what I do?
>>

Okay, this is not super-optimized but I timed it at 7 to 14 times (depending
on input) faster than your function. (BTW, LevDist02("Marco", "Co") returns
4, not 3 -- also in your function :) )

cheers,
don

Public Function LevDist02(String1 As String, String2 As String) As Long
' by Donald, 20041129
  Dim D() As Long   ' matrix
  Dim m As Long     ' length of String2
  Dim n As Long     ' length of String1
  Dim i As Long     ' iterates through String1
  Dim j As Long     ' iterates through String2
  Dim lCost As Long ' lCost
  Dim b1() As Byte
  Dim b2() As Byte

  ' Step 1
  n = Len(String1)
  m = Len(String2)
  If n = 0 Then
    LevDist02 = m
    Exit Function
  End If
  If m = 0 Then
    LevDist02 = n
    Exit Function
  End If

  ' Step 2: fill matrix
  ReDim D(0 To n, 0 To m) As Long
  For i = 0 To n
    D(i, 0) = i
  Next
  For j = 0 To m
    D(0, j) = j
  Next

  ' Step 3
  b1 = StrConv(String1, vbFromUnicode)
  b2 = StrConv(String2, vbFromUnicode)

  For i = 1 To n
    For j = 1 To m
      If b1(i - 1) = b2(j - 1) Then
        lCost = 0
      Else
        lCost = 1
      End If
      D(i, j) = MinThree01(D(i - 1, j) + 1, D(i, j - 1) + 1, D(i - 1, j - 1)
+ lCost)
    Next
  Next
  LevDist02 = D(n, m)

End Function

Private Function MinThree01(ByVal l1&, ByVal l2&, ByVal l3&) As Long
' by Donald, 20011116
  If l1 < l2 Then
    If l3 < l1 Then MinThree01 = l3 Else MinThree01 = l1
  Else
    If l2 < l3 Then MinThree01 = l2 Else MinThree01 = l3
  End If
End Function
Author
11 Oct 2007 9:40 PM
Mike Williams
"Donald Lessau" <d**@oflex.com> wrote in message
news:felqks$dja$1@newsreader2.netcologne.de...

> Okay, this is not super-optimized but I timed it at 7 to 14 times
> (depending on input) faster than your [the OP's] function.

I was about to post a response advising the OP to get rid of his IIf
statement, which in all cases is dreadfully slow compared to the equivalent
If Then Else block, especially in a compiled exe, and also advising him to
deal with the byte data of the strings rather than dealing with the
characters themselves, but your own response seems to have covered both of
those bases.

> (BTW, LevDist02("Marco", "Co") returns 4, not 3 -- also in your function
> :) )

That's because your own version is comparing the string byte data without
regard to the "case" of the characters whereas the OP's version will be
comparing them in accordance with the current setting of Option Compare
statement. That's fine of course if that is what the OP requires, but
otherwise you can easily modify your own code to deal with characters
without regard to their case. There are many ways to do that, but perhaps
the easiest way (which does not add very much time to the overall execution
time) is to use:

    b1 = StrConv(LCase(String1), vbFromUnicode)
    b2 = StrConv(LCase(String2), vbFromUnicode)

.. . . instead of your existing:

  b1 = StrConv(String1, vbFromUnicode)
  b2 = StrConv(String2, vbFromUnicode)

The only other suggestion I could make that would improve the speed would be
to deal with the String data exactly where it already "lives" in memory
without actually converting it to a Byte array (perhaps by assigning the
String to an array of Bytes or Integers using a SAFEARRAY structure) but in
this specific case that wouldn't make a lot of difference to the overall
speed unless one or more of the two strings was extraordinarily large.

Mike
Author
12 Oct 2007 10:59 AM
Co
Show quote
On 11 okt, 23:40, "Mike Williams" <mi***@whiskyandCoke.com> wrote:
> "Donald Lessau" <d***@oflex.com> wrote in message
>
> news:felqks$dja$1@newsreader2.netcologne.de...
>
> > Okay, this is not super-optimized but I timed it at 7 to 14 times
> > (depending on input) faster than your [the OP's] function.
>
> I was about to post a response advising the OP to get rid of his IIf
> statement, which in all cases is dreadfully slow compared to the equivalent
> If Then Else block, especially in a compiled exe, and also advising him to
> deal with the byte data of the strings rather than dealing with the
> characters themselves, but your own response seems to have covered both of
> those bases.
>
> > (BTW, LevDist02("Marco", "Co") returns 4, not 3 -- also in your function
> > :) )
>
> That's because your own version is comparing the string byte data without
> regard to the "case" of the characters whereas the OP's version will be
> comparing them in accordance with the current setting of Option Compare
> statement. That's fine of course if that is what the OP requires, but
> otherwise you can easily modify your own code to deal with characters
> without regard to their case. There are many ways to do that, but perhaps
> the easiest way (which does not add very much time to the overall execution
> time) is to use:
>
>     b1 = StrConv(LCase(String1), vbFromUnicode)
>     b2 = StrConv(LCase(String2), vbFromUnicode)
>
> . . . instead of your existing:
>
>   b1 = StrConv(String1, vbFromUnicode)
>   b2 = StrConv(String2, vbFromUnicode)
>
> The only other suggestion I could make that would improve the speed would be
> to deal with the String data exactly where it already "lives" in memory
> without actually converting it to a Byte array (perhaps by assigning the
> String to an array of Bytes or Integers using a SAFEARRAY structure) but in
> this specific case that wouldn't make a lot of difference to the overall
> speed unless one or more of the two strings was extraordinarily large.
>
> Mike

Mike,

The last part of your post you have to explain maybe with an example:
"The only other suggestion I could make that would improve the
speed...."

Marco
Author
12 Oct 2007 1:27 PM
Mike Williams
"Co" <vonclausow***@gmail.com> wrote in message
news:1192186751.729156.36420@v23g2000prn.googlegroups.com...

> The last part of your post you have to explain maybe with an
> example: "The only other suggestion I could make that would
> improve the speed...."

Actually I wouldn't get too excited about it because Donald's code has
already accomplished the most important task of allowing your code to deal
with integers (Byte arrays) instead of with string data, so Donald has
already speeded your stuff up almost as much as it will go. My own suggested
modification would be to save just a little bit more time by pointing the
arrays directly at the existing string data in memory rather than using
StrConv to copy the data. In some kinds of code this technique can make a
lot of difference, but in code of this nature there won't be much of an
improvement. There are various slightly different ways of doing this. The
method I've used in the following example uses Integer arrays instead of
Byte arrays and it points the string data (which of course contains the
standard VB two bytes per character) at the Integer arrays. This particular
example is the equivalent of Donald's original code which deals with the
data in a case sensitive way.

Mike

Option Explicit
Private Declare Sub BindArray Lib "kernel32" _
  Alias "RtlMoveMemory" (pArr() As Any, pSrc As Long, _
  Optional ByVal cb As Long = 4)
Private Declare Sub ReleaseArray Lib "kernel32" _
  Alias "RtlMoveMemory" (pArr() As Any, _
  Optional pSrc As Long = 0, _
  Optional ByVal cb As Long = 4)
Private Type SAFEARRAY1D
  cDims As Integer
  fFeatures As Integer
  cbElements As Long
  cLocks As Long
  pvData As Long
  cElements As Long
  lLbound As Long
End Type

Public Function LevDist02(String1 As String, _
  string2 As String) As Long
' by Donald Lessau (VB newsgroup), 20041129
' this is a VB implementation of the Levenstein
' Distance function which ranks words by their
' similarity
  Dim D() As Long   ' matrix
  Dim m As Long     ' length of String2
  Dim n As Long     ' length of String1
  Dim i As Long     ' iterates through String1
  Dim j As Long     ' iterates through String2
  Dim lCost As Long ' lCost
  Dim b1() As Integer ' previously Bytes
  Dim b2() As Integer ' previously Bytes
  Dim sa1 As SAFEARRAY1D
  Dim sa2 As SAFEARRAY1D
  ' Step 1
  n = Len(String1)
  m = Len(string2)
  If n = 0 Then
    LevDist02 = m
    Exit Function
  End If
  If m = 0 Then
    LevDist02 = n
    Exit Function
  End If
  ' Step 2: fill matrix
  ReDim D(0 To n, 0 To m) As Long
  For i = 0 To n
    D(i, 0) = i
  Next
  For j = 0 To m
    D(0, j) = j
  Next
  ' Step 3
  ' The SAFEARRAY stuff is Mike Williams'
  ' modification which points the arrays
  ' directly at the existing string data in
  ' memory instead of copying the string
  ' data into the arrays as was done in
  ' the following two commented out lines.
  ' b1 = StrConv(String1, vbFromUnicode)
  ' b2 = StrConv(string2, vbFromUnicode)
  sa1.cDims = 1
  sa1.cbElements = 2 ' 2 Bytes per Element (Integer)
  sa1.pvData = StrPtr(String1)
  sa1.cElements = Len(String1) ' number of elements
  ' sa1.lLbound is already zero
  sa2.cDims = 1
  sa2.cbElements = 2 ' 2 Bytes per Element (Integer)
  sa2.pvData = StrPtr(string2)
  sa2.cElements = Len(string2) ' number of elements
  ' sa1.lLbound is already zero
  '
  ' point both arrays at the String data
  ' Thanks to Olaf Schmidt from whom I stole the idea
  ' of modifying the RtlMoveMemory parameters (the
  ' BindArray and ReleaseArray functions) which
  ' makes it slightly easier than using the standard
  ' Copymemory version of RtlMoveMemory for this
  ' purpose ;-)
  BindArray b1, VarPtr(sa1)
  BindArray b2, VarPtr(sa2)
  '
  For i = 1 To n
    For j = 1 To m
      If b1(i - 1) = b2(j - 1) Then
        lCost = 0
      Else
        lCost = 1
      End If
      D(i, j) = MinThree01(D(i - 1, j) + 1, _
        D(i, j - 1) + 1, D(i - 1, j - 1) + lCost)
    Next
  Next
  LevDist02 = D(n, m)
' Set the arrays back to their initial state
ReleaseArray b1
ReleaseArray b2
End Function

Private Function MinThree01(ByVal l1&, ByVal l2&, _
  ByVal l3&) As Long
' by Donald, 20011116
  If l1 < l2 Then
    If l3 < l1 Then MinThree01 = l3 Else MinThree01 = l1
  Else
    If l2 < l3 Then MinThree01 = l2 Else MinThree01 = l3
  End If
End Function

Private Sub Command1_Click()
Print LevDist02("Marco", "Co")
End Sub
Author
12 Oct 2007 1:50 PM
Donald Lessau
Show quote
"Mike Williams" <m***@WhiskyAndCoke.com> schrieb im Newsbeitrag
news:%23lGq$ONDIHA.1168@TK2MSFTNGP02.phx.gbl...
> "Co" <vonclausow***@gmail.com> wrote in message
> news:1192186751.729156.36420@v23g2000prn.googlegroups.com...
>
>> The last part of your post you have to explain maybe with an
>> example: "The only other suggestion I could make that would
>> improve the speed...."
>
> Actually I wouldn't get too excited about it because Donald's code has
> already accomplished the most important task of allowing your code to deal
> with integers (Byte arrays) instead of with string data, so Donald has
> already speeded your stuff up almost as much as it will go. My own
> suggested modification would be to save just a little bit more time by
> pointing the arrays directly at the existing string data in memory rather
> than using StrConv to copy the data. In some kinds of code this technique
> can make a lot of difference, but in code of this nature there won't be
> much of an improvement. ...

Mike, the difference is notable: 20% faster, you win! :)

Levenshtein Distance/Call 4: Brandenburg Concerto vs Brandenbrug Concerto
VB6/SP4-COMPILED/AllCompileOpt ON: 10 x 10000 iterations, 12.10.2007
15:49:01
System: Athlon 64 X2 4000+, 960 MB RAM, Win XP Pro SP2, IE6.0
Rank - Rel.Time - Abs.Time - Routine              - Author
   3 -     7.53 - 100.447µ - LevDist01            - Donald
   2 -     1.20 - 15.986µs - LevDist02            - Donald
   4 -    16.63 - 221.728µ - LevDist03            - Doug Steele
   1 -     1.00 - 13.331µs - LevDist04            - Mike Williams

In words: I called LevDist**("Brandenburg Concerto", "Brandenbrug Concerto")
for 10,000 times and repeated this 10 times.

Don
Author
12 Oct 2007 6:03 PM
Mike Williams
"Donald Lessau" <d**@oflex.com> wrote in message
news:fenu2c$iki$1@newsreader2.netcologne.de...

> Mike, the difference is notable: 20% faster, you win! :)

Well actually it wasn't a contest Don, as I'm sure you know, and I certainly
wouldn't have been able to come up with the algorithm myself(!) but it's
nice to know that you're getting roughly the same test results as I am here.
With code such as this the speed difference produced by the modification is
much larger on small test words than it is on large test words, so whereas
it can be about twice as fast on small words it is down to about 20% faster
on words of the length you have used in your own tests.

The OP hasn't yet responded to your offer of providing him with your new
algorithm (the Ratcliff / Obershelp / Levenshtein method) but I wouldn't
mind looking at it myself if you don't mind. I have an idea that a more
efficient algorithm would be open to much greater percentage speed
improvements if modified in a similar way. Or perhaps not. Could be very
interesting, either way.

Mike
Author
12 Oct 2007 7:11 PM
Donald Lessau
Show quote
"Mike Williams" <m***@WhiskyAndCoke.com> schrieb im Newsbeitrag
news:uaT63oPDIHA.2324@TK2MSFTNGP03.phx.gbl...
> "Donald Lessau" <d**@oflex.com> wrote in message
> news:fenu2c$iki$1@newsreader2.netcologne.de...
>
>> Mike, the difference is notable: 20% faster, you win! :)
>
> Well actually it wasn't a contest Don, as I'm sure you know, and I
> certainly wouldn't have been able to come up with the algorithm myself(!)
> but it's nice to know that you're getting roughly the same test results as
> I am here. With code such as this the speed difference produced by the
> modification is much larger on small test words than it is on large test
> words, so whereas it can be about twice as fast on small words it is down
> to about 20% faster on words of the length you have used in your own
> tests.
>
> The OP hasn't yet responded to your offer of providing him with your new
> algorithm (the Ratcliff / Obershelp / Levenshtein method) but I wouldn't
> mind looking at it myself if you don't mind. I have an idea that a more
> efficient algorithm would be open to much greater percentage speed
> improvements if modified in a similar way. Or perhaps not. Could be very
> interesting, either way.
>

Sure it's no contest. But computers were created for speed in the first
place, and I still like this idea.

Okay, here's the Ratcliff/Obershelp/Levenshtein method: it returns a
fraction between 0 (total difference) and 1 (total similarity).

Public Function Similarity04(String1 As String, String2 As String) As Double
' by Donald, 20041128
' based on Similarity01 by Atul Brad Buono, 6/28/2000
' Ratcliff/Obershelp/Levenshtein method
  Dim l1 As Long
  Dim l2 As Long
  Dim b1() As Byte
  Dim b2() As Byte

  ' make it case sensitive!
  If String1 = String2 Then
    Similarity04 = 1
  Else
    l1 = Len(String1)
    l2 = Len(String2)

    If l1 = 0 Or l2 = 0 Then
      Similarity04 = 0
    Else
      b1 = StrConv(String1, vbFromUnicode)
      b2 = StrConv(String2, vbFromUnicode)
      Similarity04 = Similarity04helper(0, l1 - 1, 0, l2 - 1, b1, b2) * (2 /
(l1 + l2))
    End If
  End If
End Function
Private Function Similarity04helper(st1 As Long, end1 As Long, st2 As Long,
end2 As Long, b1() As Byte, b2() As Byte) As Long
' r e c u r s i v e
  Dim c1 As Long
  Dim c2 As Long
  Dim ns1 As Long
  Dim ns2 As Long
  Dim i As Long
  Dim Max As Long

  If st1 > end1 Or st2 > end2 Or st1 < 0 Or st2 < 0 Then Exit Function

  ns1 = -1
  ns2 = -1

  For c1 = st1 To end1
    ' added this for speed up: max cannot be beaten anymore
    If end1 - c1 + 1 <= Max Then Exit For
    For c2 = st2 To end2
      i = 0
      Do Until b1(c1 + i) <> b2(c2 + i)
        i = i + 1
        If i > Max Then
          ns1 = c1
          ns2 = c2
          Max = i
        End If
        If c1 + i > end1 Or c2 + i > end2 Then Exit Do
      Loop
    Next
  Next
  Max = Max + Similarity04helper(ns1 + Max, end1, ns2 + Max, end2, b1, b2)
  Max = Max + Similarity04helper(st1, ns1 - 1, st2, ns2 - 1, b1, b2)

  Similarity04helper = Max
End Function
Author
13 Oct 2007 12:06 AM
Mike Williams
"Donald Lessau" <d**@oflex.com> wrote in message
news:feogqp$nn$1@newsreader2.netcologne.de...

> Okay, here's the Ratcliff/Obershelp/Levenshtein method:

.. . . Oops. Bit of a problem in the last block of code I posted. I forgot to
include the two lines that release the arrays in the new function. This will
cause problems with VB's housekeeping and will also slow down the routine
because of the erroneous memory handling that results. Here is the modified
version. I've decided to post the complete amended block of code rather than
just the two missing lines to ensure that there is no confusion as to where
they should go. I've also included the OP's original code for completeness.
Paste the following code into a VB Form containing two Text Boxes and five
Command Buttons, with all controls placed on the right side of the Form to
allow the printed results to be seen. Compile to optimised native code.
You'll probably find that the latest version is even faster now that I've
corrected my initial mistake. I'm afraid there's a lot of code, most of it
like a pile of spaghetti, but then it is only just test bed code at the
moment :-)

Mike

Option Explicit
Private Type SAFEARRAY1D
  cDims As Integer
  fFeatures As Integer
  cbElements As Long
  cLocks As Long
  pvData As Long
  cElements As Long
  lLbound As Long
End Type
Private Declare Sub BindArray Lib "kernel32" _
  Alias "RtlMoveMemory" (pArr() As Any, pSrc As Long, _
  Optional ByVal cb As Long = 4)
Private Declare Sub ReleaseArray Lib "kernel32" _
  Alias "RtlMoveMemory" (pArr() As Any, _
  Optional pSrc As Long = 0, _
  Optional ByVal cb As Long = 4)
Private Declare Function timeBeginPeriod _
  Lib "winmm.dll" (ByVal uPeriod As Long) As Long
Private Declare Function timeEndPeriod _
  Lib "winmm.dll" (ByVal uPeriod As Long) As Long
Private Declare Function timeGetTime _
  Lib "winmm.dll" () As Long
Private s1 As String, s2 As String ' test strings

Public Function LevenshteinDistance(String1 As String, _
  String2 As String) As Integer
' The OP's original code. Marco (Co)
' This code was originally written by
' Doug Steele, MVP  AccessH***@rogers.com
' http://I.Am/DougSteele
' You are free to use it in any application
' provided the copyright notice is left unchanged.
On Error GoTo Err_LevenshteinDistance
Dim intCost As Integer
Dim intDistance() As Integer
Dim intLenString1 As Integer
Dim intLenString2 As Integer
Dim intLoop1 As Integer
Dim intLoop2 As Integer
Dim strCurrCharString1 As String
Dim strCurrCharString2 As String
' Step 1: If length of String1 is 0, return length
' of String2.
'         If length of String2 is 0, return length
' of String1.
    intLenString1 = Len(String1)
    intLenString2 = Len(String2)
    If intLenString1 = 0 Then
        LevenshteinDistance = intLenString2
    ElseIf intLenString2 = 0 Then
        LevenshteinDistance = intLenString1
    Else
' Step 2: Construct a matrix of the appropriate size.
'         Initialize the first row and first column.
ReDim intDistance(0 To intLenString1, 0 To intLenString2)
For intLoop1 = 0 To intLenString1
intDistance(intLoop1, 0) = intLoop1
Next intLoop1
For intLoop2 = 0 To intLenString2
            intDistance(0, intLoop2) = intLoop2
        Next intLoop2
' Step 3: Examine each character of String1
  For intLoop1 = 1 To intLenString1
strCurrCharString1 = Mid$(String1, intLoop1, 1)
' Step 4: Examine each character of String4
    For intLoop2 = 1 To intLenString2
    strCurrCharString2 = Mid$(String2, intLoop2, 1)
' Step 5: Calculate cost (if characters are equal,
' cost is 0, otherwise it's 1)
    intCost = IIf(strCurrCharString1 = strCurrCharString2, 0, 1)
' Step 6: Set cell i,j of the matrix equal to the
' minimum of:
'    a) the cell immediately above (i-1,j) plus 1
'    b) the cell immediately to the left (i,j-1) plus 1
'    c) the cell diagonally above and to the left
'    (i-1,j-1)plus the cost
intDistance(intLoop1, intLoop2) = _
Minimum(intDistance(intLoop1 - 1, intLoop2) + 1, _
    intDistance(intLoop1, intLoop2 - 1) + 1, _
    intDistance(intLoop1 - 1, intLoop2 - 1) + intCost)
    Next intLoop2
        Next intLoop1
    End If
End_LevenshteinDistance:
' The Levenshein distance is the value in the cell in the
' bottom right-hand corner of the matrix
LevenshteinDistance = intDistance(intLenString1, _
  intLenString2)
    Exit Function
Err_LevenshteinDistance:
    Err.Raise Err.Number, "LevenshteinDistance", _
    Err.Description
    Resume End_LevenshteinDistance
End Function

Private Function Minimum(ByVal i As Integer, ByVal J As Integer, _
  ByVal k As Integer) As Integer
' returns the minimum of three values
On Error GoTo End_Minimum
Dim intMin As Integer
    intMin = i
    If J < intMin Then
        intMin = J
    End If
    If k < intMin Then
        intMin = k
    End If
End_Minimum:
    Minimum = intMin
    Exit Function
Err_Minimum:
    Err.Raise Err.Number, "Minimum", Err.Description
    Resume End_Minimum
End Function

Public Function LevDist03A(String1 As String, _
  String2 As String) As Long
' by Donald Lessau (VB newsgroup), 20041129
' (as modified by mike Williams)
' this is a VB implementation of the Levenstein
' Distance function which ranks words by their
' similarity
  Dim D() As Long   ' matrix
  Dim m As Long     ' length of String2
  Dim n As Long     ' length of String1
  Dim i As Long     ' iterates through String1
  Dim J As Long     ' iterates through String2
  Dim lCost As Long ' lCost
  Dim b1() As Integer
  Dim b2() As Integer
  Dim sa1 As SAFEARRAY1D
  Dim sa2 As SAFEARRAY1D
  ' Step 1
  n = Len(String1)
  m = Len(String2)
  If n = 0 Then
    LevDist03A = m
    Exit Function
  End If
  If m = 0 Then
    LevDist03A = n
    Exit Function
  End If
  ' Step 2: fill matrix
  ReDim D(0 To n, 0 To m) As Long
  For i = 0 To n
    D(i, 0) = i
  Next
  For J = 0 To m
    D(0, J) = J
  Next
  ' Step 3
  ' (Mike's note: The SAFEARRAY stuff is my own
  ' modification to speed it up over Donald's
  ' original code which used Byte arrays filled
  ' from the Strings by theStrConv function.
  sa1.cDims = 1
  sa1.cbElements = 2 ' 2 Bytes per Element
  sa1.pvData = StrPtr(String1)
  sa1.cElements = Len(String1) ' number of elements
  ' sa1.lLbound is already zero
  sa2.cDims = 1
  sa2.cbElements = 2 ' 2 Bytes per Element
  sa2.pvData = StrPtr(String2)
  sa2.cElements = Len(String2) ' number of elements
  ' sa1.lLbound is already zero
  ' point both arrays at the String data
  BindArray b1, VarPtr(sa1)
  BindArray b2, VarPtr(sa2)
  '
  For i = 1 To n
    For J = 1 To m
      If b1(i - 1) = b2(J - 1) Then
        lCost = 0
      Else
        lCost = 1
      End If
      D(i, J) = MinThree01(D(i - 1, J) + 1, _
      D(i, J - 1) + 1, D(i - 1, J - 1) + lCost)
    Next
  Next
  LevDist03A = D(n, m)
' Set the arrays back to their initial state
ReleaseArray b1
ReleaseArray b2
End Function

Public Function LevDist03(String1 As String, _
  String2 As String) As Long
' by Donald Lessau (VB newsgroup), 20041129
'
' this is a VB implementation of the Levenstein
' Distance function which ranks words by their
' similarity
  Dim D() As Long   ' matrix
  Dim m As Long     ' length of String2
  Dim n As Long     ' length of String1
  Dim i As Long     ' iterates through String1
  Dim J As Long     ' iterates through String2
  Dim lCost As Long ' lCost
  Dim b1() As Byte
  Dim b2() As Byte

  ' Step 1
  n = Len(String1)
  m = Len(String2)
  If n = 0 Then
    LevDist03 = m
    Exit Function
  End If
  If m = 0 Then
    LevDist03 = n
    Exit Function
  End If
  ' Step 2: fill matrix
  ReDim D(0 To n, 0 To m) As Long
  For i = 0 To n
    D(i, 0) = i
  Next
  For J = 0 To m
    D(0, J) = J
  Next
  ' Step 3
  b1 = StrConv(String1, vbFromUnicode)
  b2 = StrConv(String2, vbFromUnicode)

  For i = 1 To n
    For J = 1 To m
      If b1(i - 1) = b2(J - 1) Then
        lCost = 0
      Else
        lCost = 1
      End If
      D(i, J) = MinThree01(D(i - 1, J) + 1, _
      D(i, J - 1) + 1, D(i - 1, J - 1) + lCost)
    Next
  Next
  LevDist03 = D(n, m)
End Function

Private Function MinThree01(ByVal l1&, ByVal l2&, _
  ByVal l3&) As Long
' by Donald, 20011116
  If l1 < l2 Then
    If l3 < l1 Then MinThree01 = l3 Else MinThree01 = l1
  Else
    If l2 < l3 Then MinThree01 = l2 Else MinThree01 = l3
  End If
End Function

Private Sub Form_Load()
timeBeginPeriod 1
Text1.Text = "Prescriptions"
Text2.Text = "Description"
Command1.Caption = "OP's original code"
Command2.Caption = "Don's first code"
Command3.Caption = "Don's first code modified"
Command4.Caption = "Don's second code"
Command5.Caption = "Don's second code modified"
Me.AutoRedraw = True
End Sub

Public Function Similarity04(String1 As String, _
  String2 As String) As Double
' This is a completely different algorithm supplied
' later by Don
' by Donald, 20041128
' based on Similarity01 by Atul Brad Buono, 6/28/2000
' Ratcliff/Obershelp/Levenshtein method
  Dim l1 As Long
  Dim l2 As Long
  Dim b1() As Byte
  Dim b2() As Byte
  ' make it case sensitive!
  If String1 = String2 Then
    Similarity04 = 1
  Else
    l1 = Len(String1)
    l2 = Len(String2)
    If l1 = 0 Or l2 = 0 Then
      Similarity04 = 0
    Else
      b1 = StrConv(String1, vbFromUnicode)
      b2 = StrConv(String2, vbFromUnicode)
      Similarity04 = Similarity04helper _
      (0, l1 - 1, 0, l2 - 1, b1, b2) * (2 / (l1 + l2))
    End If
  End If
End Function

Private Function Similarity04helper(st1 As Long, _
  end1 As Long, st2 As Long, end2 As Long, _
  b1() As Byte, b2() As Byte) As Long
' r e c u r s i v e
  Dim c1 As Long
  Dim c2 As Long
  Dim ns1 As Long
  Dim ns2 As Long
  Dim i As Long
  Dim Max As Long

  If st1 > end1 Or st2 > end2 Or st1 < 0 Or st2 < 0 _
  Then Exit Function

  ns1 = -1
  ns2 = -1
  For c1 = st1 To end1
    ' added for speed up: max cannot be beaten anymore
    If end1 - c1 + 1 <= Max Then Exit For
    For c2 = st2 To end2
      i = 0
      Do Until b1(c1 + i) <> b2(c2 + i)
        i = i + 1
        If i > Max Then
          ns1 = c1
          ns2 = c2
          Max = i
        End If
        If c1 + i > end1 Or c2 + i > end2 Then Exit Do
      Loop
    Next
  Next
  Max = Max + Similarity04helper(ns1 + Max, end1, _
  ns2 + Max, end2, b1, b2)
  Max = Max + Similarity04helper(st1, ns1 - 1, _
  st2, ns2 - 1, b1, b2)
  Similarity04helper = Max
End Function

Public Function Similarity05(String1 As String, _
  String2 As String) As Double
' This is a completely different algorithm supplied
' later by Don
' by Donald, 20041128
' based on Similarity01 by Atul Brad Buono, 6/28/2000
' Ratcliff/Obershelp/Levenshtein method
  Dim l1 As Long
  Dim l2 As Long
  Dim b1() As Integer
  Dim b2() As Integer
  Dim sa1 As SAFEARRAY1D
  Dim sa2 As SAFEARRAY1D
  ' make it case sensitive!
  If String1 = String2 Then
    Similarity05 = 1
  Else
    l1 = Len(String1)
    l2 = Len(String2)
    If l1 = 0 Or l2 = 0 Then
      Similarity05 = 0
    Else
      ' (Mike's note: The SAFEARRAY stuff is my own
      ' modification to speed it up over Donald's
      ' original code which used Byte arrays filled
      ' from the Strings by theStrConv function.
      sa1.cDims = 1
      sa1.cbElements = 2 ' 2 Bytes per Element
      sa1.pvData = StrPtr(String1)
      sa1.cElements = Len(String1) ' number of elements
      ' sa1.lLbound is already zero
      sa2.cDims = 1
      sa2.cbElements = 2 ' 2 Bytes per Element
      sa2.pvData = StrPtr(String2)
      sa2.cElements = Len(String2) ' number of elements
      ' sa1.lLbound is already zero
      ' point both arrays at the String data
      BindArray b1, VarPtr(sa1)
      BindArray b2, VarPtr(sa2)
      'b1 = StrConv(String1, vbFromUnicode)
      'b2 = StrConv(String2, vbFromUnicode)
      Similarity05 = Similarity05helper _
      (0, l1 - 1, 0, l2 - 1, b1, b2) * (2 / (l1 + l2))
    End If
  End If
' Set the arrays back to their initial state
ReleaseArray b1
ReleaseArray b2
End Function

Private Function Similarity05helper(st1 As Long, _
  end1 As Long, st2 As Long, end2 As Long, _
  b1() As Integer, b2() As Integer) As Long
' r e c u r s i v e
  Dim c1 As Long
  Dim c2 As Long
  Dim ns1 As Long
  Dim ns2 As Long
  Dim i As Long
  Dim Max As Long

  If st1 > end1 Or st2 > end2 Or st1 < 0 Or st2 < 0 _
  Then Exit Function
  ns1 = -1
  ns2 = -1
  For c1 = st1 To end1
    ' added for speed up: max cannot be beaten anymore
    If end1 - c1 + 1 <= Max Then Exit For
    For c2 = st2 To end2
      i = 0
      Do Until b1(c1 + i) <> b2(c2 + i)
        i = i + 1
        If i > Max Then
          ns1 = c1
          ns2 = c2
          Max = i
        End If
        If c1 + i > end1 Or c2 + i > end2 Then Exit Do
      Loop
    Next
  Next
  Max = Max + Similarity05helper(ns1 + Max, end1, _
  ns2 + Max, end2, b1, b2)
  Max = Max + Similarity05helper(st1, ns1 - 1, _
  st2, ns2 - 1, b1, b2)
  Similarity05helper = Max
End Function

Private Sub Command1_Click()
Command1.Enabled = False
Dim n As Long
Dim p As Long, t1 As Long, t2 As Long
s1 = Text1.Text
s2 = Text2.Text
t1 = timeGetTime
For n = 1 To 10000
  ' The OP's original code. Marco (Co)
  p = LevenshteinDistance(s1, s2)
Next n
t2 = timeGetTime
Print p, (t2 - t1) & " milliseconds for 10,000 Calls."
DoEvents
Command1.Enabled = True
End Sub

Private Sub Command2_Click()
Command2.Enabled = False
Dim n As Long
Dim p As Long, t1 As Long, t2 As Long
s1 = Text1.Text
s2 = Text2.Text
t1 = timeGetTime
For n = 1 To 10000
  ' Donald's version
  p = LevDist03(s1, s2)
Next n
t2 = timeGetTime
Print p, (t2 - t1) & " milliseconds for 10,000 Calls."
DoEvents
Command2.Enabled = True
End Sub

Private Sub Command3_Click()
Command3.Enabled = False
Dim n As Long
Dim p As Long, t1 As Long, t2 As Long
s1 = Text1.Text
s2 = Text2.Text
t1 = timeGetTime
For n = 1 To 10000
  ' Mike's modified version
  p = LevDist03A(s1, s2)
Next n
t2 = timeGetTime
Print p, (t2 - t1) & " milliseconds for 10,000 Calls."
DoEvents
Command3.Enabled = True
End Sub

Private Sub Command4_Click()
Command4.Enabled = False
Dim n As Long
Dim p As Single, t1 As Long, t2 As Long
s1 = Text1.Text
s2 = Text2.Text
t1 = timeGetTime
For n = 1 To 10000
  ' Don's second version (returns a fp value 0 to 1)
  p = Similarity04(s1, s2)
Next n
t2 = timeGetTime
Print p, (t2 - t1) & " milliseconds for 10,000 Calls."
DoEvents
Command4.Enabled = True
End Sub

Private Sub Command5_Click()
Command5.Enabled = False
Dim n As Long
Dim p As Single, t1 As Long, t2 As Long
s1 = Text1.Text
s2 = Text2.Text
t1 = timeGetTime
For n = 1 To 10000
  ' Mike's modified second version (returns value 0 to 1)
  p = Similarity05(s1, s2) ' still in work
Next n
t2 = timeGetTime
Print p, (t2 - t1) & " milliseconds for 10,000 Calls."
DoEvents
Command5.Enabled = True
End Sub

Private Sub Form_Unload(Cancel As Integer)
timeEndPeriod 1
End Sub
Author
13 Oct 2007 10:20 AM
Co
Show quote
On 13 okt, 02:06, "Mike Williams" <mi***@whiskyandCoke.com> wrote:
> "Donald Lessau" <d***@oflex.com> wrote in message
>
> news:feogqp$nn$1@newsreader2.netcologne.de...
>
> > Okay, here's the Ratcliff/Obershelp/Levenshtein method:
>
> . . . Oops. Bit of a problem in the last block of code I posted. I forgot to
> include the two lines that release the arrays in the new function. This will
> cause problems with VB's housekeeping and will also slow down the routine
> because of the erroneous memory handling that results. Here is the modified
> version. I've decided to post the complete amended block of code rather than
> just the two missing lines to ensure that there is no confusion as to where
> they should go. I've also included the OP's original code for completeness.
> Paste the following code into a VB Form containing two Text Boxes and five
> Command Buttons, with all controls placed on the right side of the Form to
> allow the printed results to be seen. Compile to optimised native code.
> You'll probably find that the latest version is even faster now that I've
> corrected my initial mistake. I'm afraid there's a lot of code, most of it
> like a pile of spaghetti, but then it is only just test bed code at the
> moment :-)
>
> Mike
>
> Option Explicit
> Private Type SAFEARRAY1D
>   cDims As Integer
>   fFeatures As Integer
>   cbElements As Long
>   cLocks As Long
>   pvData As Long
>   cElements As Long
>   lLbound As Long
> End Type
> Private Declare Sub BindArray Lib "kernel32" _
>   Alias "RtlMoveMemory" (pArr() As Any, pSrc As Long, _
>   Optional ByVal cb As Long = 4)
> Private Declare Sub ReleaseArray Lib "kernel32" _
>   Alias "RtlMoveMemory" (pArr() As Any, _
>   Optional pSrc As Long = 0, _
>   Optional ByVal cb As Long = 4)
> Private Declare Function timeBeginPeriod _
>   Lib "winmm.dll" (ByVal uPeriod As Long) As Long
> Private Declare Function timeEndPeriod _
>   Lib "winmm.dll" (ByVal uPeriod As Long) As Long
> Private Declare Function timeGetTime _
>   Lib "winmm.dll" () As Long
> Private s1 As String, s2 As String ' test strings
>
> Public Function LevenshteinDistance(String1 As String, _
>   String2 As String) As Integer
> ' The OP's original code. Marco (Co)
> ' This code was originally written by
> ' Doug Steele, MVP  AccessH***@rogers.com
> 'http://I.Am/DougSteele
> ' You are free to use it in any application
> ' provided the copyright notice is left unchanged.
> On Error GoTo Err_LevenshteinDistance
> Dim intCost As Integer
> Dim intDistance() As Integer
> Dim intLenString1 As Integer
> Dim intLenString2 As Integer
> Dim intLoop1 As Integer
> Dim intLoop2 As Integer
> Dim strCurrCharString1 As String
> Dim strCurrCharString2 As String
> ' Step 1: If length of String1 is 0, return length
> ' of String2.
> '         If length of String2 is 0, return length
> ' of String1.
>     intLenString1 = Len(String1)
>     intLenString2 = Len(String2)
>     If intLenString1 = 0 Then
>         LevenshteinDistance = intLenString2
>     ElseIf intLenString2 = 0 Then
>         LevenshteinDistance = intLenString1
>     Else
> ' Step 2: Construct a matrix of the appropriate size.
> '         Initialize the first row and first column.
> ReDim intDistance(0 To intLenString1, 0 To intLenString2)
> For intLoop1 = 0 To intLenString1
> intDistance(intLoop1, 0) = intLoop1
> Next intLoop1
> For intLoop2 = 0 To intLenString2
>             intDistance(0, intLoop2) = intLoop2
>         Next intLoop2
> ' Step 3: Examine each character of String1
>   For intLoop1 = 1 To intLenString1
> strCurrCharString1 = Mid$(String1, intLoop1, 1)
> ' Step 4: Examine each character of String4
>     For intLoop2 = 1 To intLenString2
>     strCurrCharString2 = Mid$(String2, intLoop2, 1)
> ' Step 5: Calculate cost (if characters are equal,
> ' cost is 0, otherwise it's 1)
>     intCost = IIf(strCurrCharString1 = strCurrCharString2, 0, 1)
> ' Step 6: Set cell i,j of the matrix equal to the
> ' minimum of:
> '    a) the cell immediately above (i-1,j) plus 1
> '    b) the cell immediately to the left (i,j-1) plus 1
> '    c) the cell diagonally above and to the left
> '    (i-1,j-1)plus the cost
> intDistance(intLoop1, intLoop2) = _
> Minimum(intDistance(intLoop1 - 1, intLoop2) + 1, _
>     intDistance(intLoop1, intLoop2 - 1) + 1, _
>     intDistance(intLoop1 - 1, intLoop2 - 1) + intCost)
>     Next intLoop2
>         Next intLoop1
>     End If
> End_LevenshteinDistance:
> ' The Levenshein distance is the value in the cell in the
> ' bottom right-hand corner of the matrix
> LevenshteinDistance = intDistance(intLenString1, _
>   intLenString2)
>     Exit Function
> Err_LevenshteinDistance:
>     Err.Raise Err.Number, "LevenshteinDistance", _
>     Err.Description
>     Resume End_LevenshteinDistance
> End Function
>
> Private Function Minimum(ByVal i As Integer, ByVal J As Integer, _
>   ByVal k As Integer) As Integer
> ' returns the minimum of three values
> On Error GoTo End_Minimum
> Dim intMin As Integer
>     intMin = i
>     If J < intMin Then
>         intMin = J
>     End If
>     If k < intMin Then
>         intMin = k
>     End If
> End_Minimum:
>     Minimum = intMin
>     Exit Function
> Err_Minimum:
>     Err.Raise Err.Number, "Minimum", Err.Description
>     Resume End_Minimum
> End Function
>
> Public Function LevDist03A(String1 As String, _
>   String2 As String) As Long
> ' by Donald Lessau (VB newsgroup), 20041129
> ' (as modified by mike Williams)
> ' this is a VB implementation of the Levenstein
> ' Distance function which ranks words by their
> ' similarity
>   Dim D() As Long   ' matrix
>   Dim m As Long     ' length of String2
>   Dim n As Long     ' length of String1
>   Dim i As Long     ' iterates through String1
>   Dim J As Long     ' iterates through String2
>   Dim lCost As Long ' lCost
>   Dim b1() As Integer
>   Dim b2() As Integer
>   Dim sa1 As SAFEARRAY1D
>   Dim sa2 As SAFEARRAY1D
>   ' Step 1
>   n = Len(String1)
>   m = Len(String2)
>   If n = 0 Then
>     LevDist03A = m
>     Exit Function
>   End If
>   If m = 0 Then
>     LevDist03A = n
>     Exit Function
>   End If
>   ' Step 2: fill matrix
>   ReDim D(0 To n, 0 To m) As Long
>   For i = 0 To n
>     D(i, 0) = i
>   Next
>   For J = 0 To m
>     D(0, J) = J
>   Next
>   ' Step 3
>   ' (Mike's note: The SAFEARRAY stuff is my own
>   ' modification to speed it up over Donald's
>   ' original code which used Byte arrays filled
>   ' from the Strings by theStrConv function.
>   sa1.cDims = 1
>   sa1.cbElements = 2 ' 2 Bytes per Element
>   sa1.pvData = StrPtr(String1)
>   sa1.cElements = Len(String1) ' number of elements
>   ' sa1.lLbound is already zero
>   sa2.cDims = 1
>   sa2.cbElements = 2 ' 2 Bytes per Element
>   sa2.pvData = StrPtr(String2)
>   sa2.cElements = Len(String2) ' number of elements
>   ' sa1.lLbound is already zero
>   ' point both arrays at the String data
>   BindArray b1, VarPtr(sa1)
>   BindArray b2, VarPtr(sa2)
>   '
>   For i = 1 To n
>     For J = 1 To m
>       If b1(i - 1) = b2(J - 1) Then
>         lCost = 0
>       Else
>         lCost = 1
>       End If
>       D(i, J) = MinThree01(D(i - 1, J) + 1, _
>       D(i, J - 1) + 1, D(i - 1, J - 1) + lCost)
>     Next
>   Next
>   LevDist03A = D(n, m)
> ' Set the arrays back to their initial state
> ReleaseArray b1
> ReleaseArray b2
> End Function
>
> Public Function LevDist03(String1 As String, _
>   String2 As String) As Long
> ' by Donald Lessau (VB newsgroup), 20041129
> '
> ' this is a VB implementation of the Levenstein
> ' Distance function which ranks words by their
> ' similarity
>   Dim D() As Long   ' matrix
>   Dim m As Long     ' length of String2
>   Dim n As Long     ' length of String1
>   Dim i As Long     ' iterates through String1
>   Dim J As Long     ' iterates through String2
>   Dim lCost As Long ' lCost
>   Dim b1() As Byte
>   Dim b2() As Byte
>
>   ' Step 1
>   n = Len(String1)
>   m = Len(String2)
>   If n = 0 Then
>     LevDist03 = m
>     Exit Function
>   End If
>   If m = 0 Then
>     LevDist03 = n
>     Exit Function
>   End If
>   ' Step 2: fill matrix
>   ReDim D(0 To n, 0 To m) As Long
>   For i = 0 To n
>     D(i, 0) = i
>   Next
>   For J = 0 To m
>     D(0, J) = J
>   Next
>   ' Step 3
>   b1 = StrConv(String1, vbFromUnicode)
>   b2 = StrConv(String2, vbFromUnicode)
>
>   For i = 1 To n
>     For J = 1 To m
>       If b1(i - 1) = b2(J - 1) Then
>         lCost = 0
>       Else
>         lCost = 1
>       End If
>       D(i, J) = MinThree01(D(i - 1, J) + 1, _
>       D(i, J - 1) + 1, D(i - 1, J - 1) + lCost)
>     Next
>   Next
>   LevDist03 = D(n, m)
> End Function
>
> Private Function MinThree01(ByVal l1&, ByVal l2&, _
>   ByVal l3&) As Long
> ' by Donald, 20011116
>   If l1 < l2 Then
>     If l3 < l1 Then MinThree01 = l3 Else MinThree01 = l1
>   Else
>     If l2 < l3 Then MinThree01 = l2 Else MinThree01 = l3
>   End If
> End Function
>
> Private Sub Form_Load()
> timeBeginPeriod 1
> Text1.Text = "Prescriptions"
> Text2.Text = "Description"
> Command1.Caption = "OP's original code"
> Command2.Caption = "Don's first code"
> Command3.Caption = "Don's first code modified"
> Command4.Caption = "Don's second code"
> Command5.Caption = "Don's second code modified"
> Me.AutoRedraw = True
> End Sub
>
> Public Function Similarity04(String1 As String, _
>   String2 As String) As Double
> ' This is a completely different algorithm supplied
> ' later by Don
> ' by Donald, 20041128
> ' based on Similarity01 by Atul Brad Buono, 6/28/2000
> ' Ratcliff/Obershelp/Levenshtein method
>   Dim l1 As Long
>   Dim l2 As Long
>   Dim b1() As Byte
>   Dim b2() As Byte
>   ' make it case sensitive!
>   If String1 = String2 Then
>     Similarity04 = 1
>   Else
>     l1 = Len(String1)
>     l2 = Len(String2)
>     If l1 = 0 Or l2 = 0 Then
>       Similarity04 = 0
>     Else
>       b1 = StrConv(String1, vbFromUnicode)
>       b2 = StrConv(String2, vbFromUnicode)
>       Similarity04 = Similarity04helper _
>       (0, l1 - 1, 0, l2 - 1, b1, b2) * (2 / (l1 + l2))
>     End If
>   End If
> End Function
>
> Private Function Similarity04helper(st1 As Long, _
>   end1 As Long, st2 As Long, end2 As Long, _
>   b1() As Byte, b2() As Byte) As Long
> ' r e c u r s i v e
>   Dim c1 As Long
>   Dim c2 As Long
>   Dim ns1 As Long
>   Dim ns2 As Long
>   Dim i As Long
>   Dim Max As Long
>
>   If st1 > end1 Or st2 > end2 Or st1 < 0 Or st2 < 0 _
>   Then Exit Function
>
>   ns1 = -1
>   ns2 = -1
>   For c1 = st1 To end1
>     ' added for speed up: max cannot be beaten anymore
>     If end1 - c1 + 1 <= Max Then Exit For
>     For c2 = st2 To end2
>       i = 0
>       Do Until b1(c1 + i) <> b2(c2 + i)
>         i = i + 1
>         If i > Max Then
>           ns1 = c1
>           ns2
> ...
>
> meer lezen »

Mike,

Great stuff, just one remark.
Could you add another boolean so the user can choose if he wants to
run it case sensitive or not?

Marco
Author
13 Oct 2007 12:47 PM
Mike Williams
"Co" <vonclausow***@gmail.com> wrote in message
news:1192270828.787010.50550@t8g2000prg.googlegroups.com...

> Mike, Great stuff, just one remark. Could you add
> another boolean so the user can choose if he wants
> to run it case sensitive or not?

I've been wondering how long it would be before someone asked that question
;-)  It is of course not quite so simple as "adding another Boolean", as I'm
sure you are aware, but it can certainly be done. In the original code (the
code that does not include the SAFEARRAY stuff) you could very easily make
the comparison case insensitive by simply changing the following:

      b1 = StrConv(String1, vbFromUnicode)
      b2 = StrConv(String2, vbFromUnicode)

to:

      b1 = StrConv(LCase(String1), vbFromUnicode)
      b2 = StrConv(LCase(String2), vbFromUnicode)

However, in the modified faster code it is not quite so easy. The simplest
way would be to perform an OR of the data words with the value 32, but
before doing that it would be wise to investigate which is the best place to
actually do that. In the earlier versions (where the loops in the main
algorithm were equal to the product of the two string lengths) it would be
best to do it two the two Integer arrays themselves, making sure that you
revert the contents of those arrays back to their original condition at the
end of the routine to avoid altering the String data in the calling
procedure (because the original strings will respond to any changes you make
to the Integer data). However, in the latest and fastest routine (where the
loops in the new algorithm are much less than the product of the two string
lengths) it might be best to instead do it in the algorithm itself. I think
before I wrote any amendments I might be inclined to place a loop counter
inside the new algorithm and test it on various different input strings to
see how many loops occur and how many ORs are required in each loop, and
then make the decision regarding the placement of the amendment on the basis
of whether or not the number of required OR's in the main algorithm (on
average) exceeded twice the sum of the two string lengths.

I haven't actually got time to look into any of that stuff at the moment
(Saturday and the weekend and all that, and of course the all important
England versus France rugby on TV here in the UK tonight) but perhaps others
here might have a go at it. It would certainly be interesting to investigate
both methods I have suggested, and perhaps other alternative methods, so
that we end up with a flexible and yet still fast routine. It's quite
interesting stuff, this :-)

Mike
Author
14 Oct 2007 8:48 AM
Co
Show quote
On 13 okt, 14:47, "Mike Williams" <m***@WhiskyAndCoke.com> wrote:
> "Co" <vonclausow***@gmail.com> wrote in message
>
> news:1192270828.787010.50550@t8g2000prg.googlegroups.com...
>
> > Mike, Great stuff, just one remark. Could you add
> > another boolean so the user can choose if he wants
> > to run it case sensitive or not?
>
> I've been wondering how long it would be before someone asked that question
> ;-)  It is of course not quite so simple as "adding another Boolean", as I'm
> sure you are aware, but it can certainly be done. In the original code (the
> code that does not include the SAFEARRAY stuff) you could very easily make
> the comparison case insensitive by simply changing the following:
>
>       b1 = StrConv(String1, vbFromUnicode)
>       b2 = StrConv(String2, vbFromUnicode)
>
> to:
>
>       b1 = StrConv(LCase(String1), vbFromUnicode)
>       b2 = StrConv(LCase(String2), vbFromUnicode)
>
> However, in the modified faster code it is not quite so easy. The simplest
> way would be to perform an OR of the data words with the value 32, but
> before doing that it would be wise to investigate which is the best place to
> actually do that. In the earlier versions (where the loops in the main
> algorithm were equal to the product of the two string lengths) it would be
> best to do it two the two Integer arrays themselves, making sure that you
> revert the contents of those arrays back to their original condition at the
> end of the routine to avoid altering the String data in the calling
> procedure (because the original strings will respond to any changes you make
> to the Integer data). However, in the latest and fastest routine (where the
> loops in the new algorithm are much less than the product of the two string
> lengths) it might be best to instead do it in the algorithm itself. I think
> before I wrote any amendments I might be inclined to place a loop counter
> inside the new algorithm and test it on various different input strings to
> see how many loops occur and how many ORs are required in each loop, and
> then make the decision regarding the placement of the amendment on the basis
> of whether or not the number of required OR's in the main algorithm (on
> average) exceeded twice the sum of the two string lengths.
>
> I haven't actually got time to look into any of that stuff at the moment
> (Saturday and the weekend and all that, and of course the all important
> England versus France rugby on TV here in the UK tonight) but perhaps others
> here might have a go at it. It would certainly be interesting to investigate
> both methods I have suggested, and perhaps other alternative methods, so
> that we end up with a flexible and yet still fast routine. It's quite
> interesting stuff, this :-)
>
> Mike

So did you win?

Marco
Author
14 Oct 2007 10:39 AM
Mike Williams
"Co" <vonclausow***@gmail.com> wrote in message
news:1192351715.965996.316540@i38g2000prf.googlegroups.com...

> So did you win?

We certainly did, Marco. I would have thought that you might have watched it
yourself, living in Holland which is almost next door to France :-)  I'm not
a great fan of rugby really, in fact I know nothing at all about it. I
watched it only because it was a World Cup semi final and, more importantly,
because it was against our old enemy, France [:-)], but it was nevertheless
a quite enjoyable game.

Anyway, back to business :-)  Here is my latest code for the string
comparison routine. In fact I'll stick my neck out and say that it is the
final finished code (?). The following example contains only the latest
version, and not all the previous versions. It includes the extra
functionality you asked for, the choice of either Binary (case sensitive) or
Text (case insensitive) comparison. It seems to work quite well. Paste the
code into a VB Form containing two Text Boxes and one Command Button and two
Option Buttons. For speed comparison tests compile to native code using the
advanced optimizations "Remove Array Bounds Checks" and "Remove Integer
Overflow Checks", which I think is what Donald used in the test results he
previously posted. Check it out and let me know how it goes.

Mike

Option Explicit
Private Type SAFEARRAY1D
  cDims As Integer
  fFeatures As Integer
  cbElements As Long
  cLocks As Long
  pvData As Long
  cElements As Long
  lLbound As Long
End Type
Private Declare Sub BindArray Lib "kernel32" _
  Alias "RtlMoveMemory" (pArr() As Any, pSrc As Long, _
  Optional ByVal cb As Long = 4)
Private Declare Sub ReleaseArray Lib "kernel32" _
  Alias "RtlMoveMemory" (pArr() As Any, _
  Optional pSrc As Long = 0, _
  Optional ByVal cb As Long = 4)
Private Declare Function timeBeginPeriod _
  Lib "winmm.dll" (ByVal uPeriod As Long) As Long
Private Declare Function timeEndPeriod _
  Lib "winmm.dll" (ByVal uPeriod As Long) As Long
Private Declare Function timeGetTime _
  Lib "winmm.dll" () As Long
Private s1 As String, s2 As String ' test strings

Private Sub Command1_Click()
Command1.Enabled = False
Dim n As Long
Dim p As Single, t1 As Long, t2 As Long
s1 = Text1.Text
s2 = Text2.Text
Dim cmpmode As Long
If Option1.Value = True Then
  cmpmode = vbBinaryCompare
Else
  cmpmode = vbTextCompare
End If
t1 = timeGetTime
For n = 1 To 10000
  p = Similarity07(s1, s2, cmpmode)
Next n
t2 = timeGetTime
Print p, (t2 - t1) & " milliseconds for 10,000 Calls."
DoEvents
Command1.Enabled = True
End Sub

Private Sub Form_Load()
timeBeginPeriod 1
Option1.Value = True
Option1.Caption = "Binary Compare"
Option2.Caption = "Text Compare"
Text1.Text = "Brandenburg Concerto"
Text2.Text = "Brandenbrug Concerto"
Me.AutoRedraw = True
End Sub

Public Function Similarity07(String1 As String, _
  String2 As String, comparemode As Long) As Double
' This is Mike Williams's modification to Don Lessau's
' code which was based on Similarity01 by Atul Brad
' Buono using the Ratcliff/Obershelp/Levenshtein
' comparison algorithm.
' Mike Williams's modification includes the SAFEARRAY
' structure to avoid actually copying the string data
' to Byte Arrays and also an added extra parameter to
' enable the calling routine to specify either Binary
' or Text comparison.
' (14 October 2007).
' For testing purposes compile to native code using
' Advanced Optimizations to remove Array Bounds and
' Integer Overflow error checks.
  Dim l1 As Long
  Dim l2 As Long
  Dim b1() As Integer
  Dim b2() As Integer
  Dim sa1 As SAFEARRAY1D
  Dim sa2 As SAFEARRAY1D
      'The following comparison commented out because it
      'gives negligible improvement (and sometimes no
      'improvement)in speed for strings which are equal
      'whilst slowing down the code for strings which
      'are not equal (which will be the usual case
      'in the tasks for which this code will generally
      'be used).
    'If StrComp(String1, String2, comparemode) = 0 Then
    '  Similarity07 = 1
    '  Exit Function
    'End If
    l1 = Len(String1)
    l2 = Len(String2)
    If l1 = 0 Or l2 = 0 Then
      Similarity07 = 0
    Else
      sa1.cDims = 1
      sa1.cbElements = 2 ' 2 Bytes per Element (Integer)
      sa1.pvData = StrPtr(String1)
      sa1.cElements = Len(String1) ' number of elements
      ' sa1.lLbound is already zero
      sa2.cDims = 1
      sa2.cbElements = 2 ' 2 Bytes per Element (Integer)
      sa2.pvData = StrPtr(String2)
      sa2.cElements = Len(String2) ' number of elements
      ' sa1.lLbound is already zero
      ' point both arrays at the String data
      BindArray b1, VarPtr(sa1)
      BindArray b2, VarPtr(sa2)
      Similarity07 = Similarity07helper _
      (0, l1 - 1, 0, l2 - 1, b1, b2, comparemode) _
      * (2 / (l1 + l2))
    End If
  ' Set the arrays back to their initial state
  ReleaseArray b1
  ReleaseArray b2
End Function

Private Function Similarity07helper(st1 As Long, _
  end1 As Long, st2 As Long, end2 As Long, _
  b1() As Integer, b2() As Integer, cmp As Long) As Long
  Dim c1 As Long
  Dim c2 As Long
  Dim ns1 As Long
  Dim ns2 As Long
  Dim i As Long
  Dim Max As Long
  Dim z As Long
  If st1 > end1 Or st2 > end2 Or _
           st1 < 0 Or st2 < 0 Then
    Exit Function
  End If
  ns1 = -1
  ns2 = -1
  If cmp = vbBinaryCompare Then
    z = 0
  Else
    z = 32 ' OR each "character" with 32 for lower case
  End If
  For c1 = st1 To end1
    ' added for speed up: max cannot be beaten anymore
    If end1 - c1 + 1 <= Max Then
      Exit For
    End If
    For c2 = st2 To end2
      i = 0
      Do Until (b1(c1 + i) Or z) <> (b2(c2 + i) Or z)
        i = i + 1
        If i > Max Then
          ns1 = c1
          ns2 = c2
          Max = i
        End If
        If c1 + i > end1 Or c2 + i > end2 Then
          Exit Do
        End If
      Loop
    Next c2
  Next c1
  Max = Max + Similarity07helper(ns1 + Max, end1, _
  ns2 + Max, end2, b1, b2, cmp)
  Max = Max + Similarity07helper(st1, ns1 - 1, _
  st2, ns2 - 1, b1, b2, cmp)
  Similarity07helper = Max
End Function

Private Sub Form_Unload(Cancel As Integer)
timeEndPeriod 1
End Sub
Author
27 Nov 2007 7:49 PM
Robert Morley
Hi guys,

Sorry to resurrect an old thread, but I just had a chance to try the Similarity07 code, and I'm noticing a bug...just wondering if
my copy of the code got mangled during the copy & paste from the NG post in some minute way that I'm not seeing, or if it really is
a bug.

Debug.Print Similarity07("abc", "123", vbBinaryCompare) 'same results with vbTextCompare, though.
1
Debug.Print Similarity07("abc", "abc", vbBinaryCompare)
1


Rob

Show quote
"Mike Williams" <mi***@whiskyandCoke.com> wrote in message news:u9nTc6kDIHA.3332@TK2MSFTNGP04.phx.gbl...
> "Co" <vonclausow***@gmail.com> wrote in message news:1192351715.965996.316540@i38g2000prf.googlegroups.com...
>
> We certainly did, Marco. I would have thought that you might have watched it yourself, living in Holland which is almost next door
> to France :-)  I'm not a great fan of rugby really, in fact I know nothing at all about it. I watched it only because it was a
> World Cup semi final and, more importantly, because it was against our old enemy, France [:-)], but it was nevertheless a quite
> enjoyable game.
>
> Anyway, back to business :-)  Here is my latest code for the string comparison routine. In fact I'll stick my neck out and say
> that it is the final finished code (?). The following example contains only the latest version, and not all the previous versions.
> It includes the extra functionality you asked for, the choice of either Binary (case sensitive) or Text (case insensitive)
> comparison. It seems to work quite well. Paste the code into a VB Form containing two Text Boxes and one Command Button and two
> Option Buttons. For speed comparison tests compile to native code using the advanced optimizations "Remove Array Bounds Checks"
> and "Remove Integer Overflow Checks", which I think is what Donald used in the test results he previously posted. Check it out and
> let me know how it goes.
>
> Mike
>
> Option Explicit
> Private Type SAFEARRAY1D
>  cDims As Integer
>  fFeatures As Integer
>  cbElements As Long
>  cLocks As Long
>  pvData As Long
>  cElements As Long
>  lLbound As Long
> End Type
> Private Declare Sub BindArray Lib "kernel32" _
>  Alias "RtlMoveMemory" (pArr() As Any, pSrc As Long, _
>  Optional ByVal cb As Long = 4)
> Private Declare Sub ReleaseArray Lib "kernel32" _
>  Alias "RtlMoveMemory" (pArr() As Any, _
>  Optional pSrc As Long = 0, _
>  Optional ByVal cb As Long = 4)
> Private Declare Function timeBeginPeriod _
>  Lib "winmm.dll" (ByVal uPeriod As Long) As Long
> Private Declare Function timeEndPeriod _
>  Lib "winmm.dll" (ByVal uPeriod As Long) As Long
> Private Declare Function timeGetTime _
>  Lib "winmm.dll" () As Long
> Private s1 As String, s2 As String ' test strings
>
> Private Sub Command1_Click()
> Command1.Enabled = False
> Dim n As Long
> Dim p As Single, t1 As Long, t2 As Long
> s1 = Text1.Text
> s2 = Text2.Text
> Dim cmpmode As Long
> If Option1.Value = True Then
>  cmpmode = vbBinaryCompare
> Else
>  cmpmode = vbTextCompare
> End If
> t1 = timeGetTime
> For n = 1 To 10000
>  p = Similarity07(s1, s2, cmpmode)
> Next n
> t2 = timeGetTime
> Print p, (t2 - t1) & " milliseconds for 10,000 Calls."
> DoEvents
> Command1.Enabled = True
> End Sub
>
> Private Sub Form_Load()
> timeBeginPeriod 1
> Option1.Value = True
> Option1.Caption = "Binary Compare"
> Option2.Caption = "Text Compare"
> Text1.Text = "Brandenburg Concerto"
> Text2.Text = "Brandenbrug Concerto"
> Me.AutoRedraw = True
> End Sub
>
> Public Function Similarity07(String1 As String, _
>  String2 As String, comparemode As Long) As Double
> ' This is Mike Williams's modification to Don Lessau's
> ' code which was based on Similarity01 by Atul Brad
> ' Buono using the Ratcliff/Obershelp/Levenshtein
> ' comparison algorithm.
> ' Mike Williams's modification includes the SAFEARRAY
> ' structure to avoid actually copying the string data
> ' to Byte Arrays and also an added extra parameter to
> ' enable the calling routine to specify either Binary
> ' or Text comparison.
> ' (14 October 2007).
> ' For testing purposes compile to native code using
> ' Advanced Optimizations to remove Array Bounds and
> ' Integer Overflow error checks.
>  Dim l1 As Long
>  Dim l2 As Long
>  Dim b1() As Integer
>  Dim b2() As Integer
>  Dim sa1 As SAFEARRAY1D
>  Dim sa2 As SAFEARRAY1D
>      'The following comparison commented out because it
>      'gives negligible improvement (and sometimes no
>      'improvement)in speed for strings which are equal
>      'whilst slowing down the code for strings which
>      'are not equal (which will be the usual case
>      'in the tasks for which this code will generally
>      'be used).
>    'If StrComp(String1, String2, comparemode) = 0 Then
>    '  Similarity07 = 1
>    '  Exit Function
>    'End If
>    l1 = Len(String1)
>    l2 = Len(String2)
>    If l1 = 0 Or l2 = 0 Then
>      Similarity07 = 0
>    Else
>      sa1.cDims = 1
>      sa1.cbElements = 2 ' 2 Bytes per Element (Integer)
>      sa1.pvData = StrPtr(String1)
>      sa1.cElements = Len(String1) ' number of elements
>      ' sa1.lLbound is already zero
>      sa2.cDims = 1
>      sa2.cbElements = 2 ' 2 Bytes per Element (Integer)
>      sa2.pvData = StrPtr(String2)
>      sa2.cElements = Len(String2) ' number of elements
>      ' sa1.lLbound is already zero
>      ' point both arrays at the String data
>      BindArray b1, VarPtr(sa1)
>      BindArray b2, VarPtr(sa2)
>      Similarity07 = Similarity07helper _
>      (0, l1 - 1, 0, l2 - 1, b1, b2, comparemode) _
>      * (2 / (l1 + l2))
>    End If
>  ' Set the arrays back to their initial state
>  ReleaseArray b1
>  ReleaseArray b2
> End Function
>
> Private Function Similarity07helper(st1 As Long, _
>  end1 As Long, st2 As Long, end2 As Long, _
>  b1() As Integer, b2() As Integer, cmp As Long) As Long
>  Dim c1 As Long
>  Dim c2 As Long
>  Dim ns1 As Long
>  Dim ns2 As Long
>  Dim i As Long
>  Dim Max As Long
>  Dim z As Long
>  If st1 > end1 Or st2 > end2 Or _
>           st1 < 0 Or st2 < 0 Then
>    Exit Function
>  End If
>  ns1 = -1
>  ns2 = -1
>  If cmp = vbBinaryCompare Then
>    z = 0
>  Else
>    z = 32 ' OR each "character" with 32 for lower case
>  End If
>  For c1 = st1 To end1
>    ' added for speed up: max cannot be beaten anymore
>    If end1 - c1 + 1 <= Max Then
>      Exit For
>    End If
>    For c2 = st2 To end2
>      i = 0
>      Do Until (b1(c1 + i) Or z) <> (b2(c2 + i) Or z)
>        i = i + 1
>        If i > Max Then
>          ns1 = c1
>          ns2 = c2
>          Max = i
>        End If
>        If c1 + i > end1 Or c2 + i > end2 Then
>          Exit Do
>        End If
>      Loop
>    Next c2
>  Next c1
>  Max = Max + Similarity07helper(ns1 + Max, end1, _
>  ns2 + Max, end2, b1, b2, cmp)
>  Max = Max + Similarity07helper(st1, ns1 - 1, _
>  st2, ns2 - 1, b1, b2, cmp)
>  Similarity07helper = Max
> End Function
>
> Private Sub Form_Unload(Cancel As Integer)
> timeEndPeriod 1
> End Sub
Author
27 Nov 2007 8:21 PM
Mike Williams
"Robert Morley" <rmor***@magma.ca.N0.Freak1n.sparn> wrote in message
news:u25lnFTMIHA.5160@TK2MSFTNGP05.phx.gbl...

> Hi guys, Sorry to resurrect an old thread, but I just had a
> chance to try the Similarity07 code, and I'm noticing a bug
> ...just wondering if  my copy of the code got mangled
> during the copy & paste from the NG post in some minute
> way that I'm not seeing, or if it really is a bug.
> Debug.Print Similarity07("abc", "123", vbBinaryCompare) 1

I think there must be something wrong with the code you are using, perhaps
because of a faulty copy. The above test returns a value of 0 at this end,
which is exactly what you would expect. Here is the exact code I am using.
Paste it into a VB Form comtyaining one Command Button:

Mike

Option Explicit
Private Type SAFEARRAY1D
  cDims As Integer
  fFeatures As Integer
  cbElements As Long
  cLocks As Long
  pvData As Long
  cElements As Long
  lLbound As Long
End Type
Private Declare Sub BindArray Lib "kernel32" _
  Alias "RtlMoveMemory" (pArr() As Any, pSrc As Long, _
  Optional ByVal cb As Long = 4)
Private Declare Sub ReleaseArray Lib "kernel32" _
  Alias "RtlMoveMemory" (pArr() As Any, _
  Optional pSrc As Long = 0, _
  Optional ByVal cb As Long = 4)

Private Function Similarity07(String1 As String, _
  String2 As String, comparemode As Long) As Double
' This is Mike Williams's modification to Don Lessau's
' code which was based on Similarity01 by Atul Brad
' Buono using the Ratcliff/Obershelp/Levenshtein
' comparison algorithm.
' Mike Williams's modification includes the SAFEARRAY
' structure to avoid actually copying the string data
' to Byte Arrays and also an added extra parameter to
' enable the calling routine to specify either Binary
' or Text comparison.
' (14 October 2007).
' For testing purposes compile to native code using
' Advanced Optimizations to remove Array Bounds and
' Integer Overflow error checks.
  Dim l1 As Long
  Dim l2 As Long
  Dim b1() As Integer
  Dim b2() As Integer
  Dim sa1 As SAFEARRAY1D
  Dim sa2 As SAFEARRAY1D
      'The following comparison commented out because it
      'gives negligible improvement (and sometimes no
      'improvement) in speed for strings which are equal
      'whilst slowing down the code for strings which
      'are not equal (which will be the usual case
      'in the tasks for which this code will generally
      'be used).
    'If StrComp(String1, String2, comparemode) = 0 Then
    '  Similarity07 = 1
    '  Exit Function
    'End If
    l1 = Len(String1)
    l2 = Len(String2)
    If l1 = 0 Or l2 = 0 Then
      Similarity07 = 0
    Else
      sa1.cDims = 1
      sa1.cbElements = 2 ' 2 Bytes per Element (Integer)
      sa1.pvData = StrPtr(String1)
      sa1.cElements = Len(String1) ' number of elements
      ' sa1.lLbound is already zero
      sa2.cDims = 1
      sa2.cbElements = 2 ' 2 Bytes per Element (Integer)
      sa2.pvData = StrPtr(String2)
      sa2.cElements = Len(String2) ' number of elements
      ' sa1.lLbound is already zero
      ' point both arrays at the String data
      BindArray b1, VarPtr(sa1)
      BindArray b2, VarPtr(sa2)
      Similarity07 = Similarity07helper _
      (0, l1 - 1, 0, l2 - 1, b1, b2, comparemode) _
      * (2 / (l1 + l2))
    End If
  ' Set the arrays back to their initial state
  ReleaseArray b1
  ReleaseArray b2
End Function

Private Function Similarity07helper(st1 As Long, _
  end1 As Long, st2 As Long, end2 As Long, _
  b1() As Integer, b2() As Integer, cmp As Long) As Long
  Dim c1 As Long
  Dim c2 As Long
  Dim ns1 As Long
  Dim ns2 As Long
  Dim i As Long
  Dim Max As Long
  Dim z As Long
  If st1 > end1 Or st2 > end2 Or _
           st1 < 0 Or st2 < 0 Then
    Exit Function
  End If
  ns1 = -1
  ns2 = -1
  If cmp = vbBinaryCompare Then
    z = 0
  Else
    z = 32 ' OR each "character" with 32 for lower case
  End If
  For c1 = st1 To end1
    ' added for speed up: max cannot be beaten anymore
    If end1 - c1 + 1 <= Max Then
      Exit For
    End If
    For c2 = st2 To end2
      i = 0
      Do Until (b1(c1 + i) Or z) <> (b2(c2 + i) Or z)
        i = i + 1
        If i > Max Then
          ns1 = c1
          ns2 = c2
          Max = i
        End If
        If c1 + i > end1 Or c2 + i > end2 Then
          Exit Do
        End If
      Loop
    Next c2
  Next c1
  Max = Max + Similarity07helper(ns1 + Max, end1, _
  ns2 + Max, end2, b1, b2, cmp)
  Max = Max + Similarity07helper(st1, ns1 - 1, _
  st2, ns2 - 1, b1, b2, cmp)
  Similarity07helper = Max
End Function

Private Sub Command1_Click()
MsgBox Similarity07("abc", "123", vbBinaryCompare)
End Sub
Author
27 Nov 2007 11:31 PM
Robert Morley
Yeah, works this time.  Don't know what happened in the previous iteration...I must've goofed somewhere removing the ">" 's in the
forwarded message or something.  This time I cut & pasted into a new message rather than forwarding and everything worked fine.

Thanks, Mike.  Sorry to have bothered you. :-)


Rob

Show quote
"Mike Williams" <mi***@whiskyandCoke.com> wrote in message news:euavtMTMIHA.5208@TK2MSFTNGP04.phx.gbl...
> "Robert Morley" <rmor***@magma.ca.N0.Freak1n.sparn> wrote in message news:u25lnFTMIHA.5160@TK2MSFTNGP05.phx.gbl...
>
>> Hi guys, Sorry to resurrect an old thread, but I just had a
>> chance to try the Similarity07 code, and I'm noticing a bug
>> ...just wondering if  my copy of the code got mangled
>> during the copy & paste from the NG post in some minute
>> way that I'm not seeing, or if it really is a bug.
>> Debug.Print Similarity07("abc", "123", vbBinaryCompare) 1
>
> I think there must be something wrong with the code you are using, perhaps because of a faulty copy. The above test returns a
> value of 0 at this end, which is exactly what you would expect. Here is the exact code I am using. Paste it into a VB Form
> comtyaining one Command Button:
>
> Mike
>
> Option Explicit
> Private Type SAFEARRAY1D
>  cDims As Integer
>  fFeatures As Integer
>  cbElements As Long
>  cLocks As Long
>  pvData As Long
>  cElements As Long
>  lLbound As Long
> End Type
> Private Declare Sub BindArray Lib "kernel32" _
>  Alias "RtlMoveMemory" (pArr() As Any, pSrc As Long, _
>  Optional ByVal cb As Long = 4)
> Private Declare Sub ReleaseArray Lib "kernel32" _
>  Alias "RtlMoveMemory" (pArr() As Any, _
>  Optional pSrc As Long = 0, _
>  Optional ByVal cb As Long = 4)
>
> Private Function Similarity07(String1 As String, _
>  String2 As String, comparemode As Long) As Double
> ' This is Mike Williams's modification to Don Lessau's
> ' code which was based on Similarity01 by Atul Brad
> ' Buono using the Ratcliff/Obershelp/Levenshtein
> ' comparison algorithm.
> ' Mike Williams's modification includes the SAFEARRAY
> ' structure to avoid actually copying the string data
> ' to Byte Arrays and also an added extra parameter to
> ' enable the calling routine to specify either Binary
> ' or Text comparison.
> ' (14 October 2007).
> ' For testing purposes compile to native code using
> ' Advanced Optimizations to remove Array Bounds and
> ' Integer Overflow error checks.
>  Dim l1 As Long
>  Dim l2 As Long
>  Dim b1() As Integer
>  Dim b2() As Integer
>  Dim sa1 As SAFEARRAY1D
>  Dim sa2 As SAFEARRAY1D
>      'The following comparison commented out because it
>      'gives negligible improvement (and sometimes no
>      'improvement) in speed for strings which are equal
>      'whilst slowing down the code for strings which
>      'are not equal (which will be the usual case
>      'in the tasks for which this code will generally
>      'be used).
>    'If StrComp(String1, String2, comparemode) = 0 Then
>    '  Similarity07 = 1
>    '  Exit Function
>    'End If
>    l1 = Len(String1)
>    l2 = Len(String2)
>    If l1 = 0 Or l2 = 0 Then
>      Similarity07 = 0
>    Else
>      sa1.cDims = 1
>      sa1.cbElements = 2 ' 2 Bytes per Element (Integer)
>      sa1.pvData = StrPtr(String1)
>      sa1.cElements = Len(String1) ' number of elements
>      ' sa1.lLbound is already zero
>      sa2.cDims = 1
>      sa2.cbElements = 2 ' 2 Bytes per Element (Integer)
>      sa2.pvData = StrPtr(String2)
>      sa2.cElements = Len(String2) ' number of elements
>      ' sa1.lLbound is already zero
>      ' point both arrays at the String data
>      BindArray b1, VarPtr(sa1)
>      BindArray b2, VarPtr(sa2)
>      Similarity07 = Similarity07helper _
>      (0, l1 - 1, 0, l2 - 1, b1, b2, comparemode) _
>      * (2 / (l1 + l2))
>    End If
>  ' Set the arrays back to their initial state
>  ReleaseArray b1
>  ReleaseArray b2
> End Function
>
> Private Function Similarity07helper(st1 As Long, _
>  end1 As Long, st2 As Long, end2 As Long, _
>  b1() As Integer, b2() As Integer, cmp As Long) As Long
>  Dim c1 As Long
>  Dim c2 As Long
>  Dim ns1 As Long
>  Dim ns2 As Long
>  Dim i As Long
>  Dim Max As Long
>  Dim z As Long
>  If st1 > end1 Or st2 > end2 Or _
>           st1 < 0 Or st2 < 0 Then
>    Exit Function
>  End If
>  ns1 = -1
>  ns2 = -1
>  If cmp = vbBinaryCompare Then
>    z = 0
>  Else
>    z = 32 ' OR each "character" with 32 for lower case
>  End If
>  For c1 = st1 To end1
>    ' added for speed up: max cannot be beaten anymore
>    If end1 - c1 + 1 <= Max Then
>      Exit For
>    End If
>    For c2 = st2 To end2
>      i = 0
>      Do Until (b1(c1 + i) Or z) <> (b2(c2 + i) Or z)
>        i = i + 1
>        If i > Max Then
>          ns1 = c1
>          ns2 = c2
>          Max = i
>        End If
>        If c1 + i > end1 Or c2 + i > end2 Then
>          Exit Do
>        End If
>      Loop
>    Next c2
>  Next c1
>  Max = Max + Similarity07helper(ns1 + Max, end1, _
>  ns2 + Max, end2, b1, b2, cmp)
>  Max = Max + Similarity07helper(st1, ns1 - 1, _
>  st2, ns2 - 1, b1, b2, cmp)
>  Similarity07helper = Max
> End Function
>
> Private Sub Command1_Click()
> MsgBox Similarity07("abc", "123", vbBinaryCompare)
> End Sub
>
>
>
>
>
Author
27 Nov 2007 11:43 PM
Robert Morley
Next question:  you've got BindArray calls inside an If statement, but the ReleaseArray calls are outside...is that right?


Rob
Author
28 Nov 2007 7:51 AM
Mike Williams
"Robert Morley" <rmor***@magma.ca.N0.Freak1n.sparn> wrote in message
news:uHDyE9UMIHA.5140@TK2MSFTNGP05.phx.gbl...

> Next question:  you've got BindArray calls inside an If statement,
> but the ReleaseArray calls are outside...is that right?

Thanks for noticing that, Bob. You're quite right, the ReleaseArray calls
should be inside the If . .  End if block. In this particular case having
them outside the block (where I had inadvertently put them) doesn't actually
do any harm, because ReleaseArray merely sets the b1() and b2() arrays back
to their "empty" state so they have no elements at all, but it really should
have been done inside the If . . . End If block as you have said.

Mike
Author
28 Nov 2007 7:51 AM
Mike Williams
"Robert Morley" <rmor***@magma.ca.N0.Freak1n.sparn> wrote in message
news:uHDyE9UMIHA.5140@TK2MSFTNGP05.phx.gbl...

> Next question:  you've got BindArray calls inside an If statement,
> but the ReleaseArray calls are outside...is that right?

Thanks for noticing that, Bob. You're quite right, the ReleaseArray calls
should be inside the If . .  End if block. In this particular case having
them outside the block (where I had inadvertently put them) doesn't actually
do any harm, because ReleaseArray merely sets the b1() and b2() arrays back
to their "empty" state so they have no elements at all, but it really should
have been done inside the If . . . End If block as you have said.

Mike
Author
28 Nov 2007 8:52 AM
Donald Lessau
"Robert Morley" <rmor***@magma.ca.N0.Freak1n.sparn> schrieb im Newsbeitrag
news:uHDyE9UMIHA.5140@TK2MSFTNGP05.phx.gbl...
> Next question:  you've got BindArray calls inside an If statement, but the
> ReleaseArray calls are outside...is that right?

Thanks for noticing.

Don

AddThis Social Bookmark Button