Home All Groups Group Topic Archive Search About
Author
12 Oct 2005 8:44 AM
Ivan Debono
Hi all,

This is a followup to my previous post entitled 'Designing an Expert
System'. The purpose is to parse a line of text and determine which is the
best match from a list. In this case I need to parse ingredient lines.

The routine below performs the following tasks:
1. Get soundex code of the text to parse and clean up
2. Split the text into words
3. Try to match in the tINGS recordset, and if anything is found, give it a
high hit of 5
4. If no match found, try to match by soundex codes, and if anything found,
give it a hit of 4
5. Loop through all the individual words and:
        a) Find all matches for the word
        b) Add it to the matchlist with an initial hit of 1
        c) Determine a better hit if the word starts with the same 3 letters
6) The list of possible matches is sort by hits descending.

The routine is good but far from perfect, and there's lots of room for
improvement.

Public Type typHit
    id As Long
    name As String
    hits As Long
End Type

Private Sub MatchIngredient(byval sIngredientName as String)
        sOrig = sIngredientName
        sXOri = SoundEx(sOrig)
        sName = Replace(Replace(Replace(Replace(sOrig, "-", " "), "/", " "),
";", vbNullString), "*", vbNullString)
        aName = Split(sName, " ")

        tINGS.Filter = "name LIKE '%" & Replace(sName, "'", "''") & "%' OR "
& _
                       "plural LIKE '%" & Replace(sName, "'", "''") & "%'"

        If tINGS.RecordCount > 0 Then
            Do While Not tINGS.EOF
                With tHit
                    .id = tINGS!id
                    .name = tINGS!name
                    .hits = 5
                End With
                oCol.Add tHit, CStr(tINGS!id & tINGS!name)

                tINGS.MoveNext
            Loop
        Else
            If chkSoundex.Value = vbChecked Then
                tINGS.Filter = "soundex = '" & sXOri & "'"
                Do While Not tINGS.EOF
                    With tHit
                        .id = tINGS!id
                        .name = tINGS!name
                        .hits = 4
                    End With
                    oCol.Add tHit, CStr(tINGS!id & tINGS!name)

                    tINGS.MoveNext
                Loop
            End If
        End If

        For iName = 0 To UBound(aName)
            sName = LCase$(Trim$(aName(iName)))
            If (Len(sName) > 2) And (Not IsNumeric(sName)) Then
                tINGS.Filter = "name LIKE '%" & Replace(sName, "'", "''") &
"%' OR " & _
                               "plural LIKE '%" & Replace(sName, "'", "''")
& "%'"

                Do While Not tINGS.EOF
                    sKey = tINGS!id & tINGS!name
                    With tHit
                        .id = tINGS!id
                        .name = tINGS!name
                        .hits = 1
                    End With

                    On Error Resume Next
                        oCol.Add tHit, sKey
                        If Err.Number <> 0 Then
                            tHit = oCol(sKey)
                        End If
                    On Error GoTo trap_err

                    If (InStr(tINGS!name, " ") > 0) Then
                        If LCase$(Left$(sName, 3)) =
LCase$(Left$(tINGS!name, 3)) Then
                            If LCase$(Left$(sOrig, 3)) =
LCase$(Left$(tINGS!name, 3)) Then
                                tHit.hits = tHit.hits + 2
                            Else
                                tHit.hits = tHit.hits + 1
                            End If
                        Else
                            If chkSoundex.Value = vbChecked Then
                                sXName = SoundEx(tINGS!name)
                                If sXName <> "00000" And sXName = sXOri Then
                                    tHit.hits = tHit.hits + 2
                                Else
                                    tHit.hits = tHit.hits + 0.5
                                End If
                            Else
                                tHit.hits = tHit.hits + 0.5
                            End If
                        End If
                    Else
                        If LCase$(Left$(sName, 3)) =
LCase$(Left$(tINGS!name, 3)) Then
                            If LCase$(Left$(sOrig, 3)) =
LCase$(Left$(tINGS!name, 3)) Then
                                tHit.hits = tHit.hits + 2
                            Else
                                tHit.hits = tHit.hits + 1.5
                            End If
                        Else
                            tHit.hits = tHit.hits + 1
                        End If
                    End If

                    With oCol
                        .Remove sKey
                        .Add tHit, sKey
                    End With

                    tINGS.MoveNext
                Loop
            End If
        Next iName
End Sub

The following is a list of text it needs to match and the results with their
relevant position and hits:

EXAMPLE 1:
To match:
pepper, red, lg
Found:
1. bell pepper, roasted - 2
2. chili pepper, crushed - 2
....
11. red pepper, crushed - 2
Problem:
All possibilities ha a hit of 2 but the most logical match is at position
11. This should have a higher hit.

EXAMPLE 2:
To match:
pepperoncinis
Found:
1. Pepper Mix - 4
2. Pepperoncini - 4
3. Pepperoni Sausage - 4
Problem.
Same as example 1.

EXAMPLE 3:
roasted onions
Found:
1. Roasted Green Chiles -3
2. Roasted Pecans - 3
3. Roasted Root - 3
4. Almond, roasted - 2
....
38. Onion, roasted - 2
Problem.
Logical match is at position 38!!!

If somehow can suggest further rules or changes to the matching system...
well... I'd really appreciate it!!

Thanks,
Ivan

Author
12 Oct 2005 11:58 AM
Larry Serflaten
"Ivan Debono" <ivanm***@hotmail.com> wrote
>
> This is a followup to my previous post entitled 'Designing an Expert
> System'. The purpose is to parse a line of text and determine which is the
> best match from a list. In this case I need to parse ingredient lines.

<snipped for brievity>

> If somehow can suggest further rules or changes to the matching system...
> well... I'd really appreciate it!!

I typed in a suggestion days ago but then deleted it because there were
too many loop holes, but I am going to repeat it today, just in case it
inspires others thoughts to help resolve the problem.

What I suggested was to keep an indgrediant table with two fields, one
for named ingrediants, and another for common variations:

Ingrediant: cilantro
Variations: |Cilantro -- finely chopped|Leaves of Cilantro -- optional|Chopped cilantro

Now when you get your listing, you can ask the DB to check for a match in the
variations text and if you get a result back, that probably is your ingrediant.  If
don't get any hits, then you could go word by word looking for a match in the
ingrediant field.  Again, if you find a match, that may be your ingrediant.  You could
then add that new variation to the variation field so if you see it again, there would
be a match....

Trouble arises when the ingrediant is actully a combination of other ingrediants
like 'chocolate milk', or 'strawberry yogurt'.  As I said there were a few loopholes,
but it might be another way to attack that problem....

LFS
Author
12 Oct 2005 12:17 PM
Ivan Debono
Show quote Hide quote
"Larry Serflaten" <serfla***@usinternet.com> schrieb im Newsbeitrag
news:u6WjUNyzFHA.268@TK2MSFTNGP09.phx.gbl...
>
> "Ivan Debono" <ivanm***@hotmail.com> wrote
> >
> > This is a followup to my previous post entitled 'Designing an Expert
> > System'. The purpose is to parse a line of text and determine which is
the
> > best match from a list. In this case I need to parse ingredient lines.
>
> <snipped for brievity>
>
> > If somehow can suggest further rules or changes to the matching
system...
> > well... I'd really appreciate it!!
>
> I typed in a suggestion days ago but then deleted it because there were
> too many loop holes, but I am going to repeat it today, just in case it
> inspires others thoughts to help resolve the problem.
>
> What I suggested was to keep an indgrediant table with two fields, one
> for named ingrediants, and another for common variations:
>
> Ingrediant: cilantro
> Variations: |Cilantro -- finely chopped|Leaves of Cilantro --
optional|Chopped cilantro

This is exactly what I have (or had). The variations table included a memo
field with a concatenated string just like the one above. I restructered the
table to be a 1-to-many relationship with the main ingredients table. It
solves lots of problems in other places.

> Now when you get your listing, you can ask the DB to check for a match in
the
> variations text and if you get a result back, that probably is your
ingrediant.  If
> don't get any hits, then you could go word by word looking for a match in
the
> ingrediant field.  Again, if you find a match, that may be your
ingrediant.  You could
> then add that new variation to the variation field so if you see it again,
there would
> be a match....

When I'm parsing the text I:
1. Check the original table.
2. If nothing is found, I check the variations table
3. If nothing is found there, I use the learning routine to determine a
possible match with the highest hit and add it to the variations table with
an unlearned flag

One might wonder why is step 3 necessary. The reason is simple. The match
step 3 tried to find (based on the code in the original post), could or
could not be the correct (or logical or sensible) match. This is because of
the loopholes that you  mentioned just below and other inperfections in the
algorithm.

After the parsing is done, I fire up a little tool that lists the unlearned
variations on one side and the original ingredients table on the other. For
every unlearned record, I perform the same algorithm to list all possible
matches this time, highlighting the ingredient matched in step 3 above. I
can then change the variation record to match the correct record from the
main table.

> Trouble arises when the ingrediant is actully a combination of other
ingrediants
> like 'chocolate milk', or 'strawberry yogurt'.  As I said there were a few
loopholes,
> but it might be another way to attack that problem....

Hmmm... and I'm interested in the other way or ways!!

Ivan