developer tip

VBA의 Levenshtein 거리

copycodes 2021. 1. 6. 08:32
반응형

VBA의 Levenshtein 거리


Levenshtein Distance를 얻고 싶은 데이터가 담긴 Excel 시트가 있습니다. 나는 이미 텍스트로 내보내고, 스크립트 (php)에서 읽고, Levenshtein (Levenshtein 거리 계산)을 실행하고, 다시 Excel로 저장하려고했습니다.

하지만 VBA에서 Levenshtein Distance를 프로그래밍 방식으로 계산하는 방법을 찾고 있습니다. 그렇게하려면 어떻게해야합니까?


Wikipedia 에서 번역 :

Option Explicit
Public Function Levenshtein(s1 As String, s2 As String)

Dim i As Integer
Dim j As Integer
Dim l1 As Integer
Dim l2 As Integer
Dim d() As Integer
Dim min1 As Integer
Dim min2 As Integer

l1 = Len(s1)
l2 = Len(s2)
ReDim d(l1, l2)
For i = 0 To l1
    d(i, 0) = i
Next
For j = 0 To l2
    d(0, j) = j
Next
For i = 1 To l1
    For j = 1 To l2
        If Mid(s1, i, 1) = Mid(s2, j, 1) Then
            d(i, j) = d(i - 1, j - 1)
        Else
            min1 = d(i - 1, j) + 1
            min2 = d(i, j - 1) + 1
            If min2 < min1 Then
                min1 = min2
            End If
            min2 = d(i - 1, j - 1) + 1
            If min2 < min1 Then
                min1 = min2
            End If
            d(i, j) = min1
        End If
    Next
Next
Levenshtein = d(l1, l2)
End Function

? Levenshtein ( "saturday", "sunday")


멋진 코드 게시물에 대해 smirkingman에게 감사드립니다. 다음은 최적화 된 버전입니다.

1) 대신 Asc (Mid $ (s1, i, 1)을 사용합니다. 일반적으로 숫자 비교는 텍스트보다 빠릅니다.

2) 나중에 변형 버전이므로 Mid 대신 Mid $를 사용하십시오. $ 추가는 문자열 ver입니다.

3) 최소 적용 기능 사용. (개인 선호 만)

4) Integers 대신 Long을 사용하십시오.

Function Levenshtein(ByVal string1 As String, ByVal string2 As String) As Long

Dim i As Long, j As Long
Dim string1_length As Long
Dim string2_length As Long
Dim distance() As Long

string1_length = Len(string1)
string2_length = Len(string2)
ReDim distance(string1_length, string2_length)

For i = 0 To string1_length
    distance(i, 0) = i
Next

For j = 0 To string2_length
    distance(0, j) = j
Next

For i = 1 To string1_length
    For j = 1 To string2_length
        If Asc(Mid$(string1, i, 1)) = Asc(Mid$(string2, j, 1)) Then
            distance(i, j) = distance(i - 1, j - 1)
        Else
            distance(i, j) = Application.WorksheetFunction.Min _
            (distance(i - 1, j) + 1, _
             distance(i, j - 1) + 1, _
             distance(i - 1, j - 1) + 1)
        End If
    Next
Next

Levenshtein = distance(string1_length, string2_length)

End Function

업데이트 :

원하는 사람들을 위해 : 대부분의 사람들이 Levenshtein 거리를 사용하여 퍼지 일치 비율을 계산한다고 말하는 것이 안전하다고 생각합니다. 이를 수행하는 방법은 다음과 같습니다. 최소값을 지정할 수있는 최적화를 추가했습니다. 반환 할 % 일치 (기본값은 70 % +. 수식을 실행하려면 "50"또는 "80"또는 "0"과 같은 백분율을 입력)

속도 향상은 함수가 2 개 문자열의 길이를 확인하여 사용자가 지정한 백분율 이내인지 여부를 확인한다는 사실에서 비롯됩니다. 이 기능을 최적화 할 수있는 영역이 있지만 가독성을 위해 그대로 유지했습니다. 기능 증명을 위해 거리를 연결했지만 그와 함께 할 수 있습니다 :)

Function FuzzyMatch(ByVal string1 As String, _
                    ByVal string2 As String, _
                    Optional min_percentage As Long = 70) As String

Dim i As Long, j As Long
Dim string1_length As Long
Dim string2_length As Long
Dim distance() As Long, result As Long

string1_length = Len(string1)
string2_length = Len(string2)

' Check if not too long
If string1_length >= string2_length * (min_percentage / 100) Then
    ' Check if not too short
    If string1_length <= string2_length * ((200 - min_percentage) / 100) Then

        ReDim distance(string1_length, string2_length)
        For i = 0 To string1_length: distance(i, 0) = i: Next
        For j = 0 To string2_length: distance(0, j) = j: Next

        For i = 1 To string1_length
            For j = 1 To string2_length
                If Asc(Mid$(string1, i, 1)) = Asc(Mid$(string2, j, 1)) Then
                    distance(i, j) = distance(i - 1, j - 1)
                Else
                    distance(i, j) = Application.WorksheetFunction.Min _
                    (distance(i - 1, j) + 1, _
                     distance(i, j - 1) + 1, _
                     distance(i - 1, j - 1) + 1)
                End If
            Next
        Next
        result = distance(string1_length, string2_length) 'The distance
    End If
End If

If result <> 0 Then
    FuzzyMatch = (CLng((100 - ((result / string1_length) * 100)))) & _
                 "% (" & result & ")" 'Convert to percentage
Else
    FuzzyMatch = "Not a match"
End If

End Function

17 배 속도 향상을 위해 바이트 배열 사용

  Option Explicit

  Public Declare Function GetTickCount Lib "kernel32" () As Long

  Sub test()
  Dim s1 As String, s2 As String, lTime As Long, i As Long
  s1 = Space(100)
  s2 = String(100, "a")
  lTime = GetTickCount
  For i = 1 To 100
     LevenshteinStrings s1, s2  ' the original fn from Wikibooks and Stackoverflow
  Next
  Debug.Print GetTickCount - lTime; " ms" '  3900  ms for all diff

  lTime = GetTickCount
  For i = 1 To 100
     Levenshtein s1, s2
  Next
  Debug.Print GetTickCount - lTime; " ms" ' 234  ms

  End Sub

  'Option Base 0 assumed

  'POB: fn with byte array is 17 times faster
  Function Levenshtein(ByVal string1 As String, ByVal string2 As String) As Long

  Dim i As Long, j As Long, bs1() As Byte, bs2() As Byte
  Dim string1_length As Long
  Dim string2_length As Long
  Dim distance() As Long
  Dim min1 As Long, min2 As Long, min3 As Long

  string1_length = Len(string1)
  string2_length = Len(string2)
  ReDim distance(string1_length, string2_length)
  bs1 = string1
  bs2 = string2

  For i = 0 To string1_length
      distance(i, 0) = i
  Next

  For j = 0 To string2_length
      distance(0, j) = j
  Next

  For i = 1 To string1_length
      For j = 1 To string2_length
          'slow way: If Mid$(string1, i, 1) = Mid$(string2, j, 1) Then
          If bs1((i - 1) * 2) = bs2((j - 1) * 2) Then   ' *2 because Unicode every 2nd byte is 0
              distance(i, j) = distance(i - 1, j - 1)
          Else
              'distance(i, j) = Application.WorksheetFunction.Min _
              (distance(i - 1, j) + 1, _
               distance(i, j - 1) + 1, _
               distance(i - 1, j - 1) + 1)
              ' spell it out, 50 times faster than worksheetfunction.min
              min1 = distance(i - 1, j) + 1
              min2 = distance(i, j - 1) + 1
              min3 = distance(i - 1, j - 1) + 1
              If min1 <= min2 And min1 <= min3 Then
                  distance(i, j) = min1
              ElseIf min2 <= min1 And min2 <= min3 Then
                  distance(i, j) = min2
              Else
                  distance(i, j) = min3
              End If

          End If
      Next
  Next

  Levenshtein = distance(string1_length, string2_length)

  End Function

더 빨라 졌다고 생각합니다 ... 이전 코드의 속도와 결과를 %로 개선하는 것 외에 다른 일을 많이하지 않았습니다.

' Levenshtein3 tweaked for UTLIMATE speed and CORRECT results
' Solution based on Longs
' Intermediate arrays holding Asc()make difference
' even Fixed length Arrays have impact on speed (small indeed)
' Levenshtein version 3 will return correct percentage
'
Function Levenshtein3(ByVal string1 As String, ByVal string2 As String) As Long

Dim i As Long, j As Long, string1_length As Long, string2_length As Long
Dim distance(0 To 60, 0 To 50) As Long, smStr1(1 To 60) As Long, smStr2(1 To 50) As Long
Dim min1 As Long, min2 As Long, min3 As Long, minmin As Long, MaxL As Long

string1_length = Len(string1):  string2_length = Len(string2)

distance(0, 0) = 0
For i = 1 To string1_length:    distance(i, 0) = i: smStr1(i) = Asc(LCase(Mid$(string1, i, 1))): Next
For j = 1 To string2_length:    distance(0, j) = j: smStr2(j) = Asc(LCase(Mid$(string2, j, 1))): Next
For i = 1 To string1_length
    For j = 1 To string2_length
        If smStr1(i) = smStr2(j) Then
            distance(i, j) = distance(i - 1, j - 1)
        Else
            min1 = distance(i - 1, j) + 1
            min2 = distance(i, j - 1) + 1
            min3 = distance(i - 1, j - 1) + 1
            If min2 < min1 Then
                If min2 < min3 Then minmin = min2 Else minmin = min3
            Else
                If min1 < min3 Then minmin = min1 Else minmin = min3
            End If
            distance(i, j) = minmin
        End If
    Next
Next

' Levenshtein3 will properly return a percent match (100%=exact) based on similarities and Lengths etc...
MaxL = string1_length: If string2_length > MaxL Then MaxL = string2_length
Levenshtein3 = 100 - CLng((distance(string1_length, string2_length) * 100) / MaxL)

End Function

참조 URL : https://stackoverflow.com/questions/4243036/levenshtein-distance-in-vba

반응형