Home All Groups Group Topic Archive Search About

How to manage a large dictionary of words?

Author
18 Aug 2010 2:33 PM
Larry Serflaten
I am looking for ideas on concise methods to store and lookup
words.  The plan is to include only words of 3 letters up to 8
letters in length.

The obvious move is to break them out by size and build
a big long string for each size, on which Instr() might be used.

Then I am thinking, I only need 6 bits per letter which would
allow me to pack letters together to reduce total bytes used.

The key here is that I can take all day to build the list but its
the lookup, finding if a user's word is present, that I want to
be quick about.

I'd like to include as many english words 3-8 letters in length,
as I can within a reasonable amount of space.

Does anyone have any insight as to what works good in this
situation?

LFS

Author
18 Aug 2010 2:46 PM
Dee Earley
On 18/08/2010 15:33, Larry Serflaten wrote:
Show quoteHide quote
> I am looking for ideas on concise methods to store and lookup
> words.  The plan is to include only words of 3 letters up to 8
> letters in length.
>
> The obvious move is to break them out by size and build
> a big long string for each size, on which Instr() might be used.
>
> Then I am thinking, I only need 6 bits per letter which would
> allow me to pack letters together to reduce total bytes used.
>
> The key here is that I can take all day to build the list but its
> the lookup, finding if a user's word is present, that I want to
> be quick about.
>
> I'd like to include as many english words 3-8 letters in length,
> as I can within a reasonable amount of space.
>
> Does anyone have any insight as to what works good in this
> situation?

Have a look at Binary trees.
They're normally very quick at determining existence of an entry or not.

--
Dee Earley (dee.ear***@icode.co.uk)
i-Catcher Development Team

iCode Systems

(Replies direct to my email address will be ignored.
Please reply to the group.)
Author
18 Aug 2010 2:52 PM
Kevin Provance
Show quote Hide quote
"Larry Serflaten" <serfla***@gmail.com> wrote in message
news:i4gqtm$8jj$1@news.eternal-september.org...
:I am looking for ideas on concise methods to store and lookup
: words.  The plan is to include only words of 3 letters up to 8
: letters in length.
:
: The obvious move is to break them out by size and build
: a big long string for each size, on which Instr() might be used.
:
: Then I am thinking, I only need 6 bits per letter which would
: allow me to pack letters together to reduce total bytes used.
:
: The key here is that I can take all day to build the list but its
: the lookup, finding if a user's word is present, that I want to
: be quick about.
:
: I'd like to include as many english words 3-8 letters in length,
: as I can within a reasonable amount of space.
:
: Does anyone have any insight as to what works good in this
: situation?

For speed, would storing words in a binary file and using Seek for lookup be
quicker than storing everything in memory?  I wonder if XML might be a way
to go here.  In the end it all depends on how many words were talking about.
I'm assuming tens of thousands?
Author
18 Aug 2010 3:10 PM
Horst Heinrich Dittgens
I once read about a method to store items with identical characters at the
beginning, it worked like that:

strings:   autocad,automobile,automotive,autumn...   36 char
stored as: /auto#cad&mo#bile#tive/autumn...          29 char

where
'&' = concatenate parent with next token to a new parent
'#' = concatenate parent with token to a new entry
'/' = define new parent

'auto' is the (first) parent
'cad' gives the first word 'autocad'
'mo' expands the parent to 'automo'
'bile' gives the next word 'automobile'
'tive' gives a second word 'automotive'
'autumn' replaces the current parent 'auto'

The method is useful for words with identical beginning characters,
especially useful for long words. With only short words savings decreases,
but it can be limited to zero so that the resulting partial string will not
be longer than a clear list. Savings increase highly for lists with many
identical prefixes.


Another method might be to build strings of item lists and to zip these long
strings and to unzip as needed, but I didn't test it and can nothing say
about savings and speed.
Author
18 Aug 2010 4:10 PM
Larry Serflaten
"Horst Heinrich Dittgens" <hh***@sofort-mail.de> wrote
> I once read about a method to store items with identical characters at the
> beginning, it worked like that:

Interesting compression idea, but I would have to wonder how easy it would
be to find a word from some large list like that.

Also, it would seem to me that 'au' would be the first parent, such that other
words (auburn, auction, audio, audit, ...) could be broken up in like manner.

Then do you break up words, down to a single letter?

/au#burn#ction&d#io#it&g#er#ust

auburn, auction, audio audit, auger, august, ...

And, how do you fit the plurals and other endings;  eg; audits, audition
without breaking the pattern?

It seems a bit complicated, do you have a link for more info?

Thanks
LFS
Author
18 Aug 2010 4:55 PM
Horst Heinrich Dittgens
> Interesting compression idea, but I would have to wonder how easy it would
> be to find a word from some large list like that.

You would have to build an index of the parents then you can start seeking
in the middle of the string. For example, you can build an index for the
first parent at each character in the alphabet, and then a subindex for
selected parents beginning with theese characters.

> Also, it would seem to me that 'au' would be the first parent, such that
> other
> words (auburn, auction, audio, audit, ...) could be broken up in like
> manner.
>
> Then do you break up words, down to a single letter?
>
> /au#burn#ction&d#io#it&g#er#ust
>
> auburn, auction, audio audit, auger, august, ...

You can/have to define the parent lengths as you would like, depending on
the items you want to list. If you have a couple of words beginning with the
same 4 charcaters you would take a 4 char parent, and it the next words in
your list have only 3 common characters you can use 3 for theese ones.

> And, how do you fit the plurals and other endings;  eg; audits, audition
> without breaking the pattern?

I'm not sure what you mean. The method is just made for having long prefixes
with changing extensions.

For practical use you might extend the '/' separator to be able to switch
back different levels in the parent hirarchy, just like "\..\..\" works in
DOS pathnames.


> It seems a bit complicated, do you have a link for more info?

Sorry, I do not have a link, this was from an essay in the 90s I've read
where the principle was described only.

The real creating process is a little bit sophisticated because of some kind
of iteration (you must first collect a number of words to be able to build
the abbreviated string), but the read out process is easy (due to only few
and simple rules) and very fast if you combine the parents with an indexing
method.

The advantage against an usual binary tree is that you do not have a lot of
pointers which increase the number of bytes to store.
Author
18 Aug 2010 5:11 PM
Jim Mack
Larry Serflaten wrote:
> "Horst Heinrich Dittgens" <hh***@sofort-mail.de> wrote
>> I once read about a method to store items with identical
>> characters at the beginning, it worked like that:
>
> Interesting compression idea, but I would have to wonder how easy
> it would be to find a word from some large list like that.

An old friend (Neil Rubenking) used this technique in a program called
NameGram that did anagramming of arbitrary input based on a stored
dictionary. It was damned fast, probably written in Turbo Pascal.

If you can find a copy, it comes with a de- and re-compressor for the
dictionary so the user can edit the list to suit. Having both the
clear and compressed text for a known sample file would probably
reveal the method pretty quickly.

Neil was an editor at PC Magazine at the time, I think, so it was
probably published through them and might be in their archives.

--
   Jim Mack
   Twisted tees at http://www.cafepress.com/2050inc
   "We sew confusion"
Author
18 Aug 2010 6:50 PM
Larry Serflaten
"Jim Mack" <no-uce-***@mdxi.com> wrote

> If you can find a copy, it comes with a de- and re-compressor for the
> dictionary so the user can edit the list to suit. Having both the
> clear and compressed text for a known sample file would probably
> reveal the method pretty quickly.
>
> Neil was an editor at PC Magazine at the time, I think, so it was
> probably published through them and might be in their archives.

I did find a copy, a DOS executable.  As you say, it may take a
bit of experimentation to see the method used.

FYI:  http://sanfransys.com/main_downloads.htm
(second from bottom...)

Thanks.
LFS
Author
18 Aug 2010 8:28 PM
Jim Mack
Larry Serflaten wrote:
> "Jim Mack" wrote:
>
>> If you can find a copy, it comes with a de- and re-compressor for
>> the dictionary so the user can edit the list to suit. Having both
>> the clear and compressed text for a known sample file would
>> probably reveal the method pretty quickly.
>
> I did find a copy, a DOS executable.  As you say, it may take a
> bit of experimentation to see the method used.
>
> FYI:  http://sanfransys.com/main_downloads.htm
> (second from bottom...)


Ha, sweet. The Turnip King would chuckle at the thought that a 25 year
old shareware program was still generating some interest.

I've had a copy of NameGram somewhere on my system since the days of
DOS. I manage to find a use for it at least once a year.

--
   Jim Mack
   Twisted tees at http://www.cafepress.com/2050inc
   "We sew confusion"
Author
18 Aug 2010 3:36 PM
Phil Hunt
If the words are in a sorted list, searching is pertty fast no matter how.
Of cousre, binary search is still fastest.

Show quoteHide quote
"Larry Serflaten" <serfla***@gmail.com> wrote in message
news:i4gqtm$8jj$1@news.eternal-september.org...
>I am looking for ideas on concise methods to store and lookup
> words.  The plan is to include only words of 3 letters up to 8
> letters in length.
>
> The obvious move is to break them out by size and build
> a big long string for each size, on which Instr() might be used.
>
> Then I am thinking, I only need 6 bits per letter which would
> allow me to pack letters together to reduce total bytes used.
>
> The key here is that I can take all day to build the list but its
> the lookup, finding if a user's word is present, that I want to
> be quick about.
>
> I'd like to include as many english words 3-8 letters in length,
> as I can within a reasonable amount of space.
>
> Does anyone have any insight as to what works good in this
> situation?
>
> LFS
>
>
Author
18 Aug 2010 3:50 PM
Larry Serflaten
"Phil Hunt" <a**@aaa.com> wrote

> If the words are in a sorted list, searching is pertty fast no matter how.
> Of cousre, binary search is still fastest.

Thanks, as Dee also indicated a binary chop would go quickly, but
I was still interested in storage ideas.  If it ends up paging to and
from the disk, I'd expect that to slow things down more than
the search algorithm.

As I said, I got all day to get the list in order, its the lookup
that needs to be addressed.  Part of that involves the hit or
miss of what's in memory, or not....

LFS
Author
18 Aug 2010 4:07 PM
Kevin Provance
"Larry Serflaten" <serfla***@gmail.com> wrote in message
news:i4gve3$k1b$1@news.eternal-september.org...
: Thanks, as Dee also indicated a binary chop would go quickly, but
: I was still interested in storage ideas.  If it ends up paging to and
: from the disk, I'd expect that to slow things down more than
: the search algorithm.
:
: As I said, I got all day to get the list in order, its the lookup
: that needs to be addressed.  Part of that involves the hit or
: miss of what's in memory, or not....

Again, for my customer database, I use XML (formerly I used DAO, but moved
away from it).  We're talking tens of thousands of entries.  There are two
ways to read it tho, DOM and SAX.  SAX is better for larger files as DOM
loads the entire file into memory, whereas SAX reads the XML file line by
line - where one can do filtering and the like.  It's blazingly fast, IMO.
Author
18 Aug 2010 3:55 PM
Stan Weiss
If the binary search is done correctly going from 512k enters to 1024k
enters will only add one more iteration threw the search loop.

Phil Hunt wrote:
Show quoteHide quote
>
> If the words are in a sorted list, searching is pertty fast no matter how.
> Of cousre, binary search is still fastest.
>
> "Larry Serflaten" <serfla***@gmail.com> wrote in message
> news:i4gqtm$8jj$1@news.eternal-september.org...
> >I am looking for ideas on concise methods to store and lookup
> > words.  The plan is to include only words of 3 letters up to 8
> > letters in length.
> >
> > The obvious move is to break them out by size and build
> > a big long string for each size, on which Instr() might be used.
> >
> > Then I am thinking, I only need 6 bits per letter which would
> > allow me to pack letters together to reduce total bytes used.
> >
> > The key here is that I can take all day to build the list but its
> > the lookup, finding if a user's word is present, that I want to
> > be quick about.
> >
> > I'd like to include as many english words 3-8 letters in length,
> > as I can within a reasonable amount of space.
> >
> > Does anyone have any insight as to what works good in this
> > situation?
> >
> > LFS
> >
> >
Author
18 Aug 2010 4:26 PM
Tom Shelton
Stan Weiss presented the following explanation :
> If the binary search is done correctly going from 512k enters to 1024k
> enters will only add one more iteration threw the search loop.
>
True - but, hash table look ups are usually constant time :)  So, if it
went from 512K to 10024K - the look up time would be the same.

--
Tom Shelton
Author
18 Aug 2010 4:02 PM
Tom Shelton
Larry Serflaten formulated the question :
Show quoteHide quote
> I am looking for ideas on concise methods to store and lookup
> words.  The plan is to include only words of 3 letters up to 8
> letters in length.
>
> The obvious move is to break them out by size and build
> a big long string for each size, on which Instr() might be used.
>
> Then I am thinking, I only need 6 bits per letter which would
> allow me to pack letters together to reduce total bytes used.
>
> The key here is that I can take all day to build the list but its
> the lookup, finding if a user's word is present, that I want to
> be quick about.
>
> I'd like to include as many english words 3-8 letters in length,
> as I can within a reasonable amount of space.
>
> Does anyone have any insight as to what works good in this
> situation?
>
> LFS

Just thinking out loud here :)  But, my first inclination, if this list
is to be large would to put it in a database - maybe something like
sqlite...  That's assuming that the list is too large for memory...

Now if you know the list is not so large as to put it in memory, and
that's the exercise, then, creating some sort of hash lookup table
would be probably the most efficient.   In general a hash table, is
going to be slow to insert, but fast to lookup.

--
Tom Shelton
Author
18 Aug 2010 4:31 PM
Larry Serflaten
"Tom Shelton" <tom_shelton@comcast.invalid> wrote

> Now if you know the list is not so large as to put it in memory, and
> that's the exercise, then, creating some sort of hash lookup table
> would be probably the most efficient.   In general a hash table, is
> going to be slow to insert, but fast to lookup.

I was looking to keep it in memory, and I think the hash was
where I was going with packing the letters into 6 bits per letter.
That type of packing would let me store all the 3, 4, and 5 letter
words in an array of Longs which then could be put in order
and fed to a binary search algorithm.  (each element would be
a separate word).

The 6, 7, and 8 letter words could (for ease of use) all use 6
bytes per word and be stored in a byte array, again using a
binary search to find the matching 6 bytes.

I'm not sure what that sort of structure is called, but it would
avoid any collisions a hash function might give.

That is the method I am gravitating toward, until I hear
differently....

Thanks for all the responses!
LFS
Author
18 Aug 2010 4:43 PM
Phil Hunt
Good luck. Let us know when it's done.

Show quoteHide quote
"Larry Serflaten" <serfla***@gmail.com> wrote in message
news:i4h1s9$cvq$1@news.eternal-september.org...
>
> "Tom Shelton" <tom_shelton@comcast.invalid> wrote
>
>> Now if you know the list is not so large as to put it in memory, and
>> that's the exercise, then, creating some sort of hash lookup table
>> would be probably the most efficient.   In general a hash table, is
>> going to be slow to insert, but fast to lookup.
>
> I was looking to keep it in memory, and I think the hash was
> where I was going with packing the letters into 6 bits per letter.
> That type of packing would let me store all the 3, 4, and 5 letter
> words in an array of Longs which then could be put in order
> and fed to a binary search algorithm.  (each element would be
> a separate word).
>
> The 6, 7, and 8 letter words could (for ease of use) all use 6
> bytes per word and be stored in a byte array, again using a
> binary search to find the matching 6 bytes.
>
> I'm not sure what that sort of structure is called, but it would
> avoid any collisions a hash function might give.
>
> That is the method I am gravitating toward, until I hear
> differently....
>
> Thanks for all the responses!
> LFS
>
>
>
>
Author
19 Aug 2010 9:21 AM
Cor
Larry,

I've of course plenty of solutions, but those don't fit you.

Don't think to much currently about room a character takes. It is nothing,
one picture, taken with a modern camera, contains normally more bytes than
the American constitution and all the amendments would contain.

However, I would create a dictionary, and because I don't have that in VB6
use a VB6 collection.

I was not sure if the dictionary exist in VB6 so I checked that, then I
found this, maybe you can use that.

http://www.freevbcode.com/ShowCode.Asp?ID=1582

Cor

Show quoteHide quote
"Larry Serflaten" <serfla***@gmail.com> wrote in message
news:i4gqtm$8jj$1@news.eternal-september.org...
> I am looking for ideas on concise methods to store and lookup
> words.  The plan is to include only words of 3 letters up to 8
> letters in length.
>
> The obvious move is to break them out by size and build
> a big long string for each size, on which Instr() might be used.
>
> Then I am thinking, I only need 6 bits per letter which would
> allow me to pack letters together to reduce total bytes used.
>
> The key here is that I can take all day to build the list but its
> the lookup, finding if a user's word is present, that I want to
> be quick about.
>
> I'd like to include as many english words 3-8 letters in length,
> as I can within a reasonable amount of space.
>
> Does anyone have any insight as to what works good in this
> situation?
>
> LFS
>
>
>
Author
19 Aug 2010 1:24 PM
RDub
Show quote Hide quote
"Larry Serflaten" <serfla***@gmail.com> wrote in message
news:i4gqtm$8jj$1@news.eternal-september.org...
>I am looking for ideas on concise methods to store and lookup
> words.  The plan is to include only words of 3 letters up to 8
> letters in length.
>
> The obvious move is to break them out by size and build
> a big long string for each size, on which Instr() might be used.
>
> Then I am thinking, I only need 6 bits per letter which would
> allow me to pack letters together to reduce total bytes used.
>
> The key here is that I can take all day to build the list but its
> the lookup, finding if a user's word is present, that I want to
> be quick about.
>
> I'd like to include as many english words 3-8 letters in length,
> as I can within a reasonable amount of space.
>
> Does anyone have any insight as to what works good in this
> situation?
>
> LFS
>
Looks like you are gravitating to a custom, one off, LFS system.  I also
often have gotten much joy out of building a custom something that is
exactly the perfect solution for a given task.

You don't mention how many lookups you will be doing when you app is
running, but if it is just a few per second, I'd use a database to store the
words and queries to do the lookups.  I'd be looking at a Jet db and using
ADO or DAO (your choice) to access it.  All of this stuff is available right
off of the shelf, is well proven technology, and already installed on every
Windows OS since Win 2k (maybe even Win98).  Jet is quite fast, and readily
amenable to future schema changes.

Just my $.02

Ron W
Author
19 Aug 2010 10:41 PM
Larry Serflaten
"RDub" <rweinerattrcrentdotcom> wrote

> >I am looking for ideas on concise methods to store and lookup
> > words.  The plan is to include only words of 3 letters up to 8
> > letters in length.
> <...>
> > I'd like to include as many english words 3-8 letters in length,
> > as I can within a reasonable amount of space.

> You don't mention how many lookups you will be doing when you app is
> running, but if it is just a few per second, I'd use a database to store the
> words and queries to do the lookups.


Well, for now I'm looking at a VB solution to build a little game, but it
doesn't hurt to keep an eye on moving it to a smaller form factor, such
as a smart phone or the like.  Not that I've researched that topic to any
degree, I've been thinking I should eventually put my hand in the cookie
jar to see what I can pull out!

<g>
LFS
Author
20 Aug 2010 12:03 AM
Schmidt
"Larry Serflaten" <serfla***@gmail.com> schrieb im Newsbeitrag
news:i4kbts$457$1@news.eternal-september.org...

> Not that I've researched that topic to any degree, I've been
> thinking I should eventually put my hand in the cookie
> jar to see what I can pull out!

This document here describes a nice approach - and it
has nice graphs and simple examples - shouldn't be
that difficult, to implement from the description there.
http://www.irb.hr/hr/home/ristov/papers/RistovLZtrieRevision1.pdf/

Another approach could be (but this depends a bit
on your lookup-frequency), to store compressed
chunks of your (preordered) words - and decompress
on the fly (for the given searchstring, falling into one such
chunk-boundary).
ZLib for example decompresses with a speed of
about 20-30MB/sec - so if a compressed chunk is
20kByte, then you should be able to decompress
such a chunk (always temporarily) into a predefined
area of, let's say 200kByte (to avoid the reallocation
of the destination-buffer) - and such a single
decompression of 20kByte should not take more
than 1 or 2 msec.
The lookup for the concrete string within the
temporarily decompressed chunk should be
doable in less than 1msec.

So, if 500 lookups per second are enough for you
(a Hashtable would provide much higher values),
then this would be relative easy to implement.

The LZ-trie approach in the PDF-link should work
a bit slower, but somewhere in the league of the Hashtable-
approach though (offering also much faster lookups
than only about 500 per second).

Olaf
Author
20 Aug 2010 6:28 AM
Larry Serflaten
"Schmidt" <s**@online.de> wrote

> "Larry Serflaten" <serfla***@gmail.com> schrieb
>
> > Not that I've researched that topic to any degree, I've been
> > thinking I should eventually put my hand in the cookie
> > jar to see what I can pull out!
>
> This document here describes a nice approach - and it
> has nice graphs and simple examples - shouldn't be
> that difficult, to implement from the description there.
> http://www.irb.hr/hr/home/ristov/papers/RistovLZtrieRevision1.pdf/

Thanks!  Just the sort of thing I was looking for;  A way to
achieve fast lookups with significant compression.  It seems to
be a variation on what others already suggested, breaking up the
words into repeating patterns.  I like the linked list structure over
a parsing tokens approach and will be playing with this for days to
come....

LFS
Author
20 Aug 2010 8:38 PM
Jack T.
Here is some information on tries and DAG's if you interested.

http://en.wikipedia.org/wiki/Trie

http://en.wikipedia.org/wiki/Directed_acyclic_word_graph


Show quoteHide quote
"Larry Serflaten" <serfla***@gmail.com> wrote in message
news:i4l795$ii3$1@news.eternal-september.org...
>
> "Schmidt" <s**@online.de> wrote
>
>> "Larry Serflaten" <serfla***@gmail.com> schrieb
>>
>> > Not that I've researched that topic to any degree, I've been
>> > thinking I should eventually put my hand in the cookie
>> > jar to see what I can pull out!
>>
>> This document here describes a nice approach - and it
>> has nice graphs and simple examples - shouldn't be
>> that difficult, to implement from the description there.
>> http://www.irb.hr/hr/home/ristov/papers/RistovLZtrieRevision1.pdf/
>
> Thanks!  Just the sort of thing I was looking for;  A way to
> achieve fast lookups with significant compression.  It seems to
> be a variation on what others already suggested, breaking up the
> words into repeating patterns.  I like the linked list structure over
> a parsing tokens approach and will be playing with this for days to
> come....
>
> LFS
>
>