|
code
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Lexical analysisThis is a followup to my previous post entitled 'Designing an Expert System'. The purpose is to parse a line of text and determine which is the best match from a list. In this case I need to parse ingredient lines. The routine below performs the following tasks: 1. Get soundex code of the text to parse and clean up 2. Split the text into words 3. Try to match in the tINGS recordset, and if anything is found, give it a high hit of 5 4. If no match found, try to match by soundex codes, and if anything found, give it a hit of 4 5. Loop through all the individual words and: a) Find all matches for the word b) Add it to the matchlist with an initial hit of 1 c) Determine a better hit if the word starts with the same 3 letters 6) The list of possible matches is sort by hits descending. The routine is good but far from perfect, and there's lots of room for improvement. Public Type typHit id As Long name As String hits As Long End Type Private Sub MatchIngredient(byval sIngredientName as String) sOrig = sIngredientName sXOri = SoundEx(sOrig) sName = Replace(Replace(Replace(Replace(sOrig, "-", " "), "/", " "), ";", vbNullString), "*", vbNullString) aName = Split(sName, " ") tINGS.Filter = "name LIKE '%" & Replace(sName, "'", "''") & "%' OR " & _ "plural LIKE '%" & Replace(sName, "'", "''") & "%'" If tINGS.RecordCount > 0 Then Do While Not tINGS.EOF With tHit .id = tINGS!id .name = tINGS!name .hits = 5 End With oCol.Add tHit, CStr(tINGS!id & tINGS!name) tINGS.MoveNext Loop Else If chkSoundex.Value = vbChecked Then tINGS.Filter = "soundex = '" & sXOri & "'" Do While Not tINGS.EOF With tHit .id = tINGS!id .name = tINGS!name .hits = 4 End With oCol.Add tHit, CStr(tINGS!id & tINGS!name) tINGS.MoveNext Loop End If End If For iName = 0 To UBound(aName) sName = LCase$(Trim$(aName(iName))) If (Len(sName) > 2) And (Not IsNumeric(sName)) Then tINGS.Filter = "name LIKE '%" & Replace(sName, "'", "''") & "%' OR " & _ "plural LIKE '%" & Replace(sName, "'", "''") & "%'" Do While Not tINGS.EOF sKey = tINGS!id & tINGS!name With tHit .id = tINGS!id .name = tINGS!name .hits = 1 End With On Error Resume Next oCol.Add tHit, sKey If Err.Number <> 0 Then tHit = oCol(sKey) End If On Error GoTo trap_err If (InStr(tINGS!name, " ") > 0) Then If LCase$(Left$(sName, 3)) = LCase$(Left$(tINGS!name, 3)) Then If LCase$(Left$(sOrig, 3)) = LCase$(Left$(tINGS!name, 3)) Then tHit.hits = tHit.hits + 2 Else tHit.hits = tHit.hits + 1 End If Else If chkSoundex.Value = vbChecked Then sXName = SoundEx(tINGS!name) If sXName <> "00000" And sXName = sXOri Then tHit.hits = tHit.hits + 2 Else tHit.hits = tHit.hits + 0.5 End If Else tHit.hits = tHit.hits + 0.5 End If End If Else If LCase$(Left$(sName, 3)) = LCase$(Left$(tINGS!name, 3)) Then If LCase$(Left$(sOrig, 3)) = LCase$(Left$(tINGS!name, 3)) Then tHit.hits = tHit.hits + 2 Else tHit.hits = tHit.hits + 1.5 End If Else tHit.hits = tHit.hits + 1 End If End If With oCol .Remove sKey .Add tHit, sKey End With tINGS.MoveNext Loop End If Next iName End Sub The following is a list of text it needs to match and the results with their relevant position and hits: EXAMPLE 1: To match: pepper, red, lg Found: 1. bell pepper, roasted - 2 2. chili pepper, crushed - 2 .... 11. red pepper, crushed - 2 Problem: All possibilities ha a hit of 2 but the most logical match is at position 11. This should have a higher hit. EXAMPLE 2: To match: pepperoncinis Found: 1. Pepper Mix - 4 2. Pepperoncini - 4 3. Pepperoni Sausage - 4 Problem. Same as example 1. EXAMPLE 3: roasted onions Found: 1. Roasted Green Chiles -3 2. Roasted Pecans - 3 3. Roasted Root - 3 4. Almond, roasted - 2 .... 38. Onion, roasted - 2 Problem. Logical match is at position 38!!! If somehow can suggest further rules or changes to the matching system... well... I'd really appreciate it!! Thanks, Ivan "Ivan Debono" <ivanm***@hotmail.com> wrote <snipped for brievity> > > This is a followup to my previous post entitled 'Designing an Expert > System'. The purpose is to parse a line of text and determine which is the > best match from a list. In this case I need to parse ingredient lines. > If somehow can suggest further rules or changes to the matching system... I typed in a suggestion days ago but then deleted it because there were> well... I'd really appreciate it!! too many loop holes, but I am going to repeat it today, just in case it inspires others thoughts to help resolve the problem. What I suggested was to keep an indgrediant table with two fields, one for named ingrediants, and another for common variations: Ingrediant: cilantro Variations: |Cilantro -- finely chopped|Leaves of Cilantro -- optional|Chopped cilantro Now when you get your listing, you can ask the DB to check for a match in the variations text and if you get a result back, that probably is your ingrediant. If don't get any hits, then you could go word by word looking for a match in the ingrediant field. Again, if you find a match, that may be your ingrediant. You could then add that new variation to the variation field so if you see it again, there would be a match.... Trouble arises when the ingrediant is actully a combination of other ingrediants like 'chocolate milk', or 'strawberry yogurt'. As I said there were a few loopholes, but it might be another way to attack that problem.... LFS
Show quote
Hide quote
"Larry Serflaten" <serfla***@usinternet.com> schrieb im Newsbeitrag optional|Chopped cilantronews:u6WjUNyzFHA.268@TK2MSFTNGP09.phx.gbl... > > "Ivan Debono" <ivanm***@hotmail.com> wrote > > > > This is a followup to my previous post entitled 'Designing an Expert > > System'. The purpose is to parse a line of text and determine which is the > > best match from a list. In this case I need to parse ingredient lines. > > <snipped for brievity> > > > If somehow can suggest further rules or changes to the matching system... > > well... I'd really appreciate it!! > > I typed in a suggestion days ago but then deleted it because there were > too many loop holes, but I am going to repeat it today, just in case it > inspires others thoughts to help resolve the problem. > > What I suggested was to keep an indgrediant table with two fields, one > for named ingrediants, and another for common variations: > > Ingrediant: cilantro > Variations: |Cilantro -- finely chopped|Leaves of Cilantro -- This is exactly what I have (or had). The variations table included a memo field with a concatenated string just like the one above. I restructered the table to be a 1-to-many relationship with the main ingredients table. It solves lots of problems in other places. > Now when you get your listing, you can ask the DB to check for a match in ingrediant. Ifthe > variations text and if you get a result back, that probably is your > don't get any hits, then you could go word by word looking for a match in ingrediant. You couldthe > ingrediant field. Again, if you find a match, that may be your > then add that new variation to the variation field so if you see it again, When I'm parsing the text I:there would > be a match.... 1. Check the original table. 2. If nothing is found, I check the variations table 3. If nothing is found there, I use the learning routine to determine a possible match with the highest hit and add it to the variations table with an unlearned flag One might wonder why is step 3 necessary. The reason is simple. The match step 3 tried to find (based on the code in the original post), could or could not be the correct (or logical or sensible) match. This is because of the loopholes that you mentioned just below and other inperfections in the algorithm. After the parsing is done, I fire up a little tool that lists the unlearned variations on one side and the original ingredients table on the other. For every unlearned record, I perform the same algorithm to list all possible matches this time, highlighting the ingredient matched in step 3 above. I can then change the variation record to match the correct record from the main table. > Trouble arises when the ingrediant is actully a combination of other Hmmm... and I'm interested in the other way or ways!!ingrediants > like 'chocolate milk', or 'strawberry yogurt'. As I said there were a few loopholes, > but it might be another way to attack that problem.... Ivan
Is there a more elegant way to do this?
Command Button Sequence populating combobox Update Workstations with new executable Compile, Save, Exit, ReStart, big trouble Problem connecting to remote MySQL DB from VB6 How many ORs can you have in ADODB Need to know when a process is exited Best way to have a subroutine modify more than one variable Scheduling Control |
|||||||||||||||||||||||