|
code
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
How to manage a large dictionary of words?words. The plan is to include only words of 3 letters up to 8 letters in length. The obvious move is to break them out by size and build a big long string for each size, on which Instr() might be used. Then I am thinking, I only need 6 bits per letter which would allow me to pack letters together to reduce total bytes used. The key here is that I can take all day to build the list but its the lookup, finding if a user's word is present, that I want to be quick about. I'd like to include as many english words 3-8 letters in length, as I can within a reasonable amount of space. Does anyone have any insight as to what works good in this situation? LFS On 18/08/2010 15:33, Larry Serflaten wrote:
Show quoteHide quote > I am looking for ideas on concise methods to store and lookup Have a look at Binary trees.> words. The plan is to include only words of 3 letters up to 8 > letters in length. > > The obvious move is to break them out by size and build > a big long string for each size, on which Instr() might be used. > > Then I am thinking, I only need 6 bits per letter which would > allow me to pack letters together to reduce total bytes used. > > The key here is that I can take all day to build the list but its > the lookup, finding if a user's word is present, that I want to > be quick about. > > I'd like to include as many english words 3-8 letters in length, > as I can within a reasonable amount of space. > > Does anyone have any insight as to what works good in this > situation? They're normally very quick at determining existence of an entry or not. -- Dee Earley (dee.ear***@icode.co.uk) i-Catcher Development Team iCode Systems (Replies direct to my email address will be ignored. Please reply to the group.)
Show quote
Hide quote
"Larry Serflaten" <serfla***@gmail.com> wrote in message For speed, would storing words in a binary file and using Seek for lookup be news:i4gqtm$8jj$1@news.eternal-september.org... :I am looking for ideas on concise methods to store and lookup : words. The plan is to include only words of 3 letters up to 8 : letters in length. : : The obvious move is to break them out by size and build : a big long string for each size, on which Instr() might be used. : : Then I am thinking, I only need 6 bits per letter which would : allow me to pack letters together to reduce total bytes used. : : The key here is that I can take all day to build the list but its : the lookup, finding if a user's word is present, that I want to : be quick about. : : I'd like to include as many english words 3-8 letters in length, : as I can within a reasonable amount of space. : : Does anyone have any insight as to what works good in this : situation? quicker than storing everything in memory? I wonder if XML might be a way to go here. In the end it all depends on how many words were talking about. I'm assuming tens of thousands? I once read about a method to store items with identical characters at the
beginning, it worked like that: strings: autocad,automobile,automotive,autumn... 36 char stored as: /auto#cad&mo#bile#tive/autumn... 29 char where '&' = concatenate parent with next token to a new parent '#' = concatenate parent with token to a new entry '/' = define new parent 'auto' is the (first) parent 'cad' gives the first word 'autocad' 'mo' expands the parent to 'automo' 'bile' gives the next word 'automobile' 'tive' gives a second word 'automotive' 'autumn' replaces the current parent 'auto' The method is useful for words with identical beginning characters, especially useful for long words. With only short words savings decreases, but it can be limited to zero so that the resulting partial string will not be longer than a clear list. Savings increase highly for lists with many identical prefixes. Another method might be to build strings of item lists and to zip these long strings and to unzip as needed, but I didn't test it and can nothing say about savings and speed. "Horst Heinrich Dittgens" <hh***@sofort-mail.de> wrote Interesting compression idea, but I would have to wonder how easy it would> I once read about a method to store items with identical characters at the > beginning, it worked like that: be to find a word from some large list like that. Also, it would seem to me that 'au' would be the first parent, such that other words (auburn, auction, audio, audit, ...) could be broken up in like manner. Then do you break up words, down to a single letter? /au#burn#ction&d#io#it&g#er#ust auburn, auction, audio audit, auger, august, ... And, how do you fit the plurals and other endings; eg; audits, audition without breaking the pattern? It seems a bit complicated, do you have a link for more info? Thanks LFS > Interesting compression idea, but I would have to wonder how easy it would You would have to build an index of the parents then you can start seeking > be to find a word from some large list like that. in the middle of the string. For example, you can build an index for the first parent at each character in the alphabet, and then a subindex for selected parents beginning with theese characters. > Also, it would seem to me that 'au' would be the first parent, such that You can/have to define the parent lengths as you would like, depending on > other > words (auburn, auction, audio, audit, ...) could be broken up in like > manner. > > Then do you break up words, down to a single letter? > > /au#burn#ction&d#io#it&g#er#ust > > auburn, auction, audio audit, auger, august, ... the items you want to list. If you have a couple of words beginning with the same 4 charcaters you would take a 4 char parent, and it the next words in your list have only 3 common characters you can use 3 for theese ones. > And, how do you fit the plurals and other endings; eg; audits, audition I'm not sure what you mean. The method is just made for having long prefixes > without breaking the pattern? with changing extensions. For practical use you might extend the '/' separator to be able to switch back different levels in the parent hirarchy, just like "\..\..\" works in DOS pathnames. > It seems a bit complicated, do you have a link for more info? Sorry, I do not have a link, this was from an essay in the 90s I've read where the principle was described only. The real creating process is a little bit sophisticated because of some kind of iteration (you must first collect a number of words to be able to build the abbreviated string), but the read out process is easy (due to only few and simple rules) and very fast if you combine the parents with an indexing method. The advantage against an usual binary tree is that you do not have a lot of pointers which increase the number of bytes to store. Larry Serflaten wrote:
> "Horst Heinrich Dittgens" <hh***@sofort-mail.de> wrote An old friend (Neil Rubenking) used this technique in a program called>> I once read about a method to store items with identical >> characters at the beginning, it worked like that: > > Interesting compression idea, but I would have to wonder how easy > it would be to find a word from some large list like that. NameGram that did anagramming of arbitrary input based on a stored dictionary. It was damned fast, probably written in Turbo Pascal. If you can find a copy, it comes with a de- and re-compressor for the dictionary so the user can edit the list to suit. Having both the clear and compressed text for a known sample file would probably reveal the method pretty quickly. Neil was an editor at PC Magazine at the time, I think, so it was probably published through them and might be in their archives. "Jim Mack" <no-uce-***@mdxi.com> wrote I did find a copy, a DOS executable. As you say, it may take a> If you can find a copy, it comes with a de- and re-compressor for the > dictionary so the user can edit the list to suit. Having both the > clear and compressed text for a known sample file would probably > reveal the method pretty quickly. > > Neil was an editor at PC Magazine at the time, I think, so it was > probably published through them and might be in their archives. bit of experimentation to see the method used. FYI: http://sanfransys.com/main_downloads.htm (second from bottom...) Thanks. LFS Larry Serflaten wrote:
> "Jim Mack" wrote: Ha, sweet. The Turnip King would chuckle at the thought that a 25 year> >> If you can find a copy, it comes with a de- and re-compressor for >> the dictionary so the user can edit the list to suit. Having both >> the clear and compressed text for a known sample file would >> probably reveal the method pretty quickly. > > I did find a copy, a DOS executable. As you say, it may take a > bit of experimentation to see the method used. > > FYI: http://sanfransys.com/main_downloads.htm > (second from bottom...) old shareware program was still generating some interest. I've had a copy of NameGram somewhere on my system since the days of DOS. I manage to find a use for it at least once a year. If the words are in a sorted list, searching is pertty fast no matter how.
Of cousre, binary search is still fastest. Show quoteHide quote "Larry Serflaten" <serfla***@gmail.com> wrote in message news:i4gqtm$8jj$1@news.eternal-september.org... >I am looking for ideas on concise methods to store and lookup > words. The plan is to include only words of 3 letters up to 8 > letters in length. > > The obvious move is to break them out by size and build > a big long string for each size, on which Instr() might be used. > > Then I am thinking, I only need 6 bits per letter which would > allow me to pack letters together to reduce total bytes used. > > The key here is that I can take all day to build the list but its > the lookup, finding if a user's word is present, that I want to > be quick about. > > I'd like to include as many english words 3-8 letters in length, > as I can within a reasonable amount of space. > > Does anyone have any insight as to what works good in this > situation? > > LFS > > "Phil Hunt" <a**@aaa.com> wrote Thanks, as Dee also indicated a binary chop would go quickly, but> If the words are in a sorted list, searching is pertty fast no matter how. > Of cousre, binary search is still fastest. I was still interested in storage ideas. If it ends up paging to and from the disk, I'd expect that to slow things down more than the search algorithm. As I said, I got all day to get the list in order, its the lookup that needs to be addressed. Part of that involves the hit or miss of what's in memory, or not.... LFS "Larry Serflaten" <serfla***@gmail.com> wrote in message Again, for my customer database, I use XML (formerly I used DAO, but moved news:i4gve3$k1b$1@news.eternal-september.org... : Thanks, as Dee also indicated a binary chop would go quickly, but : I was still interested in storage ideas. If it ends up paging to and : from the disk, I'd expect that to slow things down more than : the search algorithm. : : As I said, I got all day to get the list in order, its the lookup : that needs to be addressed. Part of that involves the hit or : miss of what's in memory, or not.... away from it). We're talking tens of thousands of entries. There are two ways to read it tho, DOM and SAX. SAX is better for larger files as DOM loads the entire file into memory, whereas SAX reads the XML file line by line - where one can do filtering and the like. It's blazingly fast, IMO. If the binary search is done correctly going from 512k enters to 1024k
enters will only add one more iteration threw the search loop. Phil Hunt wrote: Show quoteHide quote > > If the words are in a sorted list, searching is pertty fast no matter how. > Of cousre, binary search is still fastest. > > "Larry Serflaten" <serfla***@gmail.com> wrote in message > news:i4gqtm$8jj$1@news.eternal-september.org... > >I am looking for ideas on concise methods to store and lookup > > words. The plan is to include only words of 3 letters up to 8 > > letters in length. > > > > The obvious move is to break them out by size and build > > a big long string for each size, on which Instr() might be used. > > > > Then I am thinking, I only need 6 bits per letter which would > > allow me to pack letters together to reduce total bytes used. > > > > The key here is that I can take all day to build the list but its > > the lookup, finding if a user's word is present, that I want to > > be quick about. > > > > I'd like to include as many english words 3-8 letters in length, > > as I can within a reasonable amount of space. > > > > Does anyone have any insight as to what works good in this > > situation? > > > > LFS > > > > Stan Weiss presented the following explanation :
> If the binary search is done correctly going from 512k enters to 1024k True - but, hash table look ups are usually constant time :) So, if it > enters will only add one more iteration threw the search loop. > went from 512K to 10024K - the look up time would be the same. -- Tom Shelton Larry Serflaten formulated the question :
Show quoteHide quote > I am looking for ideas on concise methods to store and lookup Just thinking out loud here :) But, my first inclination, if this list > words. The plan is to include only words of 3 letters up to 8 > letters in length. > > The obvious move is to break them out by size and build > a big long string for each size, on which Instr() might be used. > > Then I am thinking, I only need 6 bits per letter which would > allow me to pack letters together to reduce total bytes used. > > The key here is that I can take all day to build the list but its > the lookup, finding if a user's word is present, that I want to > be quick about. > > I'd like to include as many english words 3-8 letters in length, > as I can within a reasonable amount of space. > > Does anyone have any insight as to what works good in this > situation? > > LFS is to be large would to put it in a database - maybe something like sqlite... That's assuming that the list is too large for memory... Now if you know the list is not so large as to put it in memory, and that's the exercise, then, creating some sort of hash lookup table would be probably the most efficient. In general a hash table, is going to be slow to insert, but fast to lookup. -- Tom Shelton "Tom Shelton" <tom_shelton@comcast.invalid> wrote I was looking to keep it in memory, and I think the hash was> Now if you know the list is not so large as to put it in memory, and > that's the exercise, then, creating some sort of hash lookup table > would be probably the most efficient. In general a hash table, is > going to be slow to insert, but fast to lookup. where I was going with packing the letters into 6 bits per letter. That type of packing would let me store all the 3, 4, and 5 letter words in an array of Longs which then could be put in order and fed to a binary search algorithm. (each element would be a separate word). The 6, 7, and 8 letter words could (for ease of use) all use 6 bytes per word and be stored in a byte array, again using a binary search to find the matching 6 bytes. I'm not sure what that sort of structure is called, but it would avoid any collisions a hash function might give. That is the method I am gravitating toward, until I hear differently.... Thanks for all the responses! LFS Good luck. Let us know when it's done.
Show quoteHide quote "Larry Serflaten" <serfla***@gmail.com> wrote in message news:i4h1s9$cvq$1@news.eternal-september.org... > > "Tom Shelton" <tom_shelton@comcast.invalid> wrote > >> Now if you know the list is not so large as to put it in memory, and >> that's the exercise, then, creating some sort of hash lookup table >> would be probably the most efficient. In general a hash table, is >> going to be slow to insert, but fast to lookup. > > I was looking to keep it in memory, and I think the hash was > where I was going with packing the letters into 6 bits per letter. > That type of packing would let me store all the 3, 4, and 5 letter > words in an array of Longs which then could be put in order > and fed to a binary search algorithm. (each element would be > a separate word). > > The 6, 7, and 8 letter words could (for ease of use) all use 6 > bytes per word and be stored in a byte array, again using a > binary search to find the matching 6 bytes. > > I'm not sure what that sort of structure is called, but it would > avoid any collisions a hash function might give. > > That is the method I am gravitating toward, until I hear > differently.... > > Thanks for all the responses! > LFS > > > > Larry,
I've of course plenty of solutions, but those don't fit you. Don't think to much currently about room a character takes. It is nothing, one picture, taken with a modern camera, contains normally more bytes than the American constitution and all the amendments would contain. However, I would create a dictionary, and because I don't have that in VB6 use a VB6 collection. I was not sure if the dictionary exist in VB6 so I checked that, then I found this, maybe you can use that. http://www.freevbcode.com/ShowCode.Asp?ID=1582 Cor Show quoteHide quote "Larry Serflaten" <serfla***@gmail.com> wrote in message news:i4gqtm$8jj$1@news.eternal-september.org... > I am looking for ideas on concise methods to store and lookup > words. The plan is to include only words of 3 letters up to 8 > letters in length. > > The obvious move is to break them out by size and build > a big long string for each size, on which Instr() might be used. > > Then I am thinking, I only need 6 bits per letter which would > allow me to pack letters together to reduce total bytes used. > > The key here is that I can take all day to build the list but its > the lookup, finding if a user's word is present, that I want to > be quick about. > > I'd like to include as many english words 3-8 letters in length, > as I can within a reasonable amount of space. > > Does anyone have any insight as to what works good in this > situation? > > LFS > > >
Show quote
Hide quote
"Larry Serflaten" <serfla***@gmail.com> wrote in message Looks like you are gravitating to a custom, one off, LFS system. I also news:i4gqtm$8jj$1@news.eternal-september.org... >I am looking for ideas on concise methods to store and lookup > words. The plan is to include only words of 3 letters up to 8 > letters in length. > > The obvious move is to break them out by size and build > a big long string for each size, on which Instr() might be used. > > Then I am thinking, I only need 6 bits per letter which would > allow me to pack letters together to reduce total bytes used. > > The key here is that I can take all day to build the list but its > the lookup, finding if a user's word is present, that I want to > be quick about. > > I'd like to include as many english words 3-8 letters in length, > as I can within a reasonable amount of space. > > Does anyone have any insight as to what works good in this > situation? > > LFS > often have gotten much joy out of building a custom something that is exactly the perfect solution for a given task. You don't mention how many lookups you will be doing when you app is running, but if it is just a few per second, I'd use a database to store the words and queries to do the lookups. I'd be looking at a Jet db and using ADO or DAO (your choice) to access it. All of this stuff is available right off of the shelf, is well proven technology, and already installed on every Windows OS since Win 2k (maybe even Win98). Jet is quite fast, and readily amenable to future schema changes. Just my $.02 Ron W "RDub" <rweinerattrcrentdotcom> wrote Well, for now I'm looking at a VB solution to build a little game, but it> >I am looking for ideas on concise methods to store and lookup > > words. The plan is to include only words of 3 letters up to 8 > > letters in length. > <...> > > I'd like to include as many english words 3-8 letters in length, > > as I can within a reasonable amount of space. > You don't mention how many lookups you will be doing when you app is > running, but if it is just a few per second, I'd use a database to store the > words and queries to do the lookups. doesn't hurt to keep an eye on moving it to a smaller form factor, such as a smart phone or the like. Not that I've researched that topic to any degree, I've been thinking I should eventually put my hand in the cookie jar to see what I can pull out! <g> LFS "Larry Serflaten" <serfla***@gmail.com> schrieb im Newsbeitrag This document here describes a nice approach - and itnews:i4kbts$457$1@news.eternal-september.org... > Not that I've researched that topic to any degree, I've been > thinking I should eventually put my hand in the cookie > jar to see what I can pull out! has nice graphs and simple examples - shouldn't be that difficult, to implement from the description there. http://www.irb.hr/hr/home/ristov/papers/RistovLZtrieRevision1.pdf/ Another approach could be (but this depends a bit on your lookup-frequency), to store compressed chunks of your (preordered) words - and decompress on the fly (for the given searchstring, falling into one such chunk-boundary). ZLib for example decompresses with a speed of about 20-30MB/sec - so if a compressed chunk is 20kByte, then you should be able to decompress such a chunk (always temporarily) into a predefined area of, let's say 200kByte (to avoid the reallocation of the destination-buffer) - and such a single decompression of 20kByte should not take more than 1 or 2 msec. The lookup for the concrete string within the temporarily decompressed chunk should be doable in less than 1msec. So, if 500 lookups per second are enough for you (a Hashtable would provide much higher values), then this would be relative easy to implement. The LZ-trie approach in the PDF-link should work a bit slower, but somewhere in the league of the Hashtable- approach though (offering also much faster lookups than only about 500 per second). Olaf "Schmidt" <s**@online.de> wrote Thanks! Just the sort of thing I was looking for; A way to> "Larry Serflaten" <serfla***@gmail.com> schrieb > > > Not that I've researched that topic to any degree, I've been > > thinking I should eventually put my hand in the cookie > > jar to see what I can pull out! > > This document here describes a nice approach - and it > has nice graphs and simple examples - shouldn't be > that difficult, to implement from the description there. > http://www.irb.hr/hr/home/ristov/papers/RistovLZtrieRevision1.pdf/ achieve fast lookups with significant compression. It seems to be a variation on what others already suggested, breaking up the words into repeating patterns. I like the linked list structure over a parsing tokens approach and will be playing with this for days to come.... LFS Here is some information on tries and DAG's if you interested.
http://en.wikipedia.org/wiki/Trie http://en.wikipedia.org/wiki/Directed_acyclic_word_graph Show quoteHide quote "Larry Serflaten" <serfla***@gmail.com> wrote in message news:i4l795$ii3$1@news.eternal-september.org... > > "Schmidt" <s**@online.de> wrote > >> "Larry Serflaten" <serfla***@gmail.com> schrieb >> >> > Not that I've researched that topic to any degree, I've been >> > thinking I should eventually put my hand in the cookie >> > jar to see what I can pull out! >> >> This document here describes a nice approach - and it >> has nice graphs and simple examples - shouldn't be >> that difficult, to implement from the description there. >> http://www.irb.hr/hr/home/ristov/papers/RistovLZtrieRevision1.pdf/ > > Thanks! Just the sort of thing I was looking for; A way to > achieve fast lookups with significant compression. It seems to > be a variation on what others already suggested, breaking up the > words into repeating patterns. I like the linked list structure over > a parsing tokens approach and will be playing with this for days to > come.... > > LFS > >
Re: How to get the File Properties
Latest VB6 Cumulative Update? GDI+ and resizing a PictureBox KBasic .NUT Fail .Net slipped itself in while I wasn't looking! Specify DLL version when using mc.exe, rc.exe and/or link.exe Callback function Best Approach (psuedocode) for summing structure element values Re: Problem with Direct Sound / VB6 |
|||||||||||||||||||||||