|
code
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Text filesI have a text file that is going to have a couple million lines or so in it.
I want to be able to search for a particular number (that number can be in the file more than once) and then get the 10 lines before and the 10 lines after that and write that to a file. I have no problem if I am just search for the one line and return that line. Should I read it all into memory and then check it or open the file and read each line at a time? If someone can give me a code sample I would appreciate it? Thanks Ed > I have a text file that is going to have a couple million lines It kind of depends. What do you mean by a "line" in your file? Are all your> or so in it. I want to be able to search for a particular number > (that number can be in the file more than once) and then get > the 10 lines before and the 10 lines after that and write that > to a file. I have no problem if I am just search for the one > line and return that line. Should I read it all into memory and > then check it or open the file and read each line at a time? lines single "sentences" terminated by a "newline" character sequence? Or are they multiple sentence paragraphs, again, terminated by a "newline" character sequence? Or are you looking for each sentence even if there are more than one of them in a paragraph? I ask because a Line Input statement (what I'm guessing you are referring to when you say "read each line") uses the "newline" character sequence to decide what a line is. Also, to decide on whether to read the entire file into memory before parsing it, you should tell us how large (in megabytes) your average and largest files are (that is, are your lines short and we are talking about 20 to 50 Megs, or are they large and we are talking about a half a Gig, or what)? Rick This is basically a log file with errors in it. Each error is on a single
line. I am using Input # right now and I can find the specific line but I am having a problem getting the 10 lines before and the 10 lines after. The file can be about 100 megs or so. Show quoteHide quote "Rick Rothstein [MVP - Visual Basic]" <rickNOSPAMnews@NOSPAMcomcast.net> wrote in message news:OctyruMKGHA.3984@TK2MSFTNGP14.phx.gbl... >> I have a text file that is going to have a couple million lines >> or so in it. I want to be able to search for a particular number >> (that number can be in the file more than once) and then get >> the 10 lines before and the 10 lines after that and write that >> to a file. I have no problem if I am just search for the one >> line and return that line. Should I read it all into memory and >> then check it or open the file and read each line at a time? > > It kind of depends. What do you mean by a "line" in your file? Are all > your > lines single "sentences" terminated by a "newline" character sequence? Or > are they multiple sentence paragraphs, again, terminated by a "newline" > character sequence? Or are you looking for each sentence even if there are > more than one of them in a paragraph? I ask because a Line Input statement > (what I'm guessing you are referring to when you say "read each line") > uses > the "newline" character sequence to decide what a line is. Also, to decide > on whether to read the entire file into memory before parsing it, you > should tell us how large (in megabytes) your average and largest files are > (that is, are your lines short and we are talking about 20 to 50 Megs, or > are they large and we are talking about a half a Gig, or what)? > > Rick > > "Ed Wyche" <d***@care.com> wrote Now, what do you want to do with those 10 lines?> This is basically a log file with errors in it. Each error is on a single > line. I am using Input # right now and I can find the specific line but I > am having a problem getting the 10 lines before and the 10 lines after. The > file can be about 100 megs or so. If it were me, I'd set up a (FIFO) queue that holds 10 lines. I'd read each line and check it for inclusion and place it in the queue. When the queue gets to be 10 lines long, then each addition to the queue also needs to pull one off the queue. If the current line is supposed to be 'kept', I'd start a countdown variable that would countdown from 20, and save the next 21 lines as they are being removed from the queue. That would mean the entire file would be read a line at a time and all the lines would pass through the queue. It might be a relatively long process, but perhaps under a few minutes. The queue could be a Collection and you would fill it by adding the lines to the collection. If the Count is greater than 10 after you add a new line, then the line stored in the first position has to be removed, etc... It would be easy to set up, try it and see! LFS What I need to is to kept the 10 lines before the inclusion and the 10 lines
after the inclusion and the inclusion saved to a file. There is going to be more then one so it will have to go all the way through the file. Show quoteHide quote "Larry Serflaten" <serfla***@usinternet.com> wrote in message news:OHYZ9KNKGHA.3064@TK2MSFTNGP10.phx.gbl... > > "Ed Wyche" <d***@care.com> wrote >> This is basically a log file with errors in it. Each error is on a >> single >> line. I am using Input # right now and I can find the specific line but >> I >> am having a problem getting the 10 lines before and the 10 lines after. > The >> file can be about 100 megs or so. > > Now, what do you want to do with those 10 lines? > > If it were me, I'd set up a (FIFO) queue that holds 10 lines. I'd read > each line and check it for inclusion and place it in the queue. When the > queue gets to be 10 lines long, then each addition to the queue also needs > to pull one off the queue. If the current line is supposed to be 'kept', > I'd start > a countdown variable that would countdown from 20, and save the next > 21 lines as they are being removed from the queue. > > That would mean the entire file would be read a line at a time and all the > lines would pass through the queue. It might be a relatively long > process, > but perhaps under a few minutes. > > The queue could be a Collection and you would fill it by adding the lines > to the collection. If the Count is greater than 10 after you add a new > line, > then the line stored in the first position has to be removed, etc... > > It would be easy to set up, try it and see! > > LFS > > "Ed Wyche" <d***@care.com> wrote Precisely. When you find a line that needs saving, there will be 10 lines> What I need to is to kept the 10 lines before the inclusion and the 10 lines > after the inclusion and the inclusion saved to a file. There is going to be > more then one so it will have to go all the way through the file. in the queue (unless that particular line is one of the first 10 in the file) so you start a counter down from 21 and save all the lines that come off the queue until the counter is at 0. That would save the 10 lines ahead of the included line, the included line itself, plus10 lines following the included line. In simple terms, you basically cycle all the lines into and out of the queue. When a line needs saving you set a flag to indicate the lines coming out of the queue need to be saved. That flag is a counter that determines how many lines to save. Since you have 10 lines in the queue when you find a line, you'll get 10 lines saved ahead of the line you found. After you set the counter to 21, you save every line that comes off the queue until that counter reaches 0. So thats 10 ahead, the line itself, and 10 after, (21) that is the group you're looking for.... LFS Can you give me an example please.
Show quoteHide quote "Larry Serflaten" <serfla***@usinternet.com> wrote in message news:eXcXnCQKGHA.1452@TK2MSFTNGP10.phx.gbl... > > "Ed Wyche" <d***@care.com> wrote >> What I need to is to kept the 10 lines before the inclusion and the 10 > lines >> after the inclusion and the inclusion saved to a file. There is going to > be >> more then one so it will have to go all the way through the file. > > Precisely. When you find a line that needs saving, there will be 10 lines > in the queue (unless that particular line is one of the first 10 in the > file) so > you start a counter down from 21 and save all the lines that come off the > queue until the counter is at 0. That would save the 10 lines ahead of > the > included line, the included line itself, plus10 lines following the > included > line. > > In simple terms, you basically cycle all the lines into and out of the > queue. > When a line needs saving you set a flag to indicate the lines coming out > of > the queue need to be saved. That flag is a counter that determines how > many lines to save. Since you have 10 lines in the queue when you find a > line, you'll get 10 lines saved ahead of the line you found. After you > set > the > counter to 21, you save every line that comes off the queue until that > counter > reaches 0. So thats 10 ahead, the line itself, and 10 after, (21) that is > the group > you're looking for.... > > LFS > > "Ed Wyche" <whocares> wrote It looks like you already have something going with Mike.> Can you give me an example please. Perhaps that is a better route.... LFS "Ed Wyche" <d***@care.com> wrote in message If I were you I would keep a continuous record of the last dozen or so lines news:etSac%23MKGHA.2712@TK2MSFTNGP10.phx.gbl... > This is basically a log file with errors in it. Each error is on a single > line. I am using Input # right now and I can find the specific line but > I am having a problem getting the 10 lines before . . . read and keep reading new lines until you finf the one that interests you. For example, you could have a simple string array (o to 11) which you could visualise as a 12 hour clock, but numbered from 0 to 11 instead of 1 to 12. Just dump each line as it is read into the appropriate element of the array, each time moving to the next digit around the clock. Keep a running record of the current line number and the "clock" digit. Then when you find a line that interests you it is a simple matter to look back through the array to pick out the ten or so preceeding lines and then run a simple loop to pick out the following ten or so following lines. Here's how I would do the first bit (reading lines and shoving them into the "clock" array until I find one that interests me: Private Sub Command1_Click() Dim store(0 To 11) As String Dim fn As Long, pointer As Long, linenumber As Long fn = FreeFile pointer = -1 ' initialise pointer Open "c:\myfile.txt" For Input As fn Do linenumber = linenumber + 1 pointer = (pointer + 1) Mod 12 Line Input #fn, store(pointer) Loop Until store(pointer) = "looking for this" MsgBox Format(linenumber) & " " & Format(pointer) _ & " " & Format(store(pointer)) Close fn End Sub Mike Ok I can see how that works, Now how do I get the results is the item I
want to find is at pointer 5 that means I need to return 5, 4, 3, 2, 1, 0, 11, 10, 9, 8, 7, 6 in this order. How do I go about that. Also I am looking at make it a variable of how many lines to return. Show quoteHide quote "Mike Williams" <M***@WhiskyAndCoke.com> wrote in message news:OtMiuuNKGHA.500@TK2MSFTNGP15.phx.gbl... > "Ed Wyche" <d***@care.com> wrote in message > news:etSac%23MKGHA.2712@TK2MSFTNGP10.phx.gbl... > >> This is basically a log file with errors in it. Each error is on a >> single line. I am using Input # right now and I can find the specific >> line but >> I am having a problem getting the 10 lines before . . . > > If I were you I would keep a continuous record of the last dozen or so > lines read and keep reading new lines until you finf the one that > interests you. For example, you could have a simple string array (o to 11) > which you could visualise as a 12 hour clock, but numbered from 0 to 11 > instead of 1 to 12. Just dump each line as it is read into the appropriate > element of the array, each time moving to the next digit around the clock. > Keep a running record of the current line number and the "clock" digit. > Then when you find a line that interests you it is a simple matter to look > back through the array to pick out the ten or so preceeding lines and then > run a simple loop to pick out the following ten or so following lines. > Here's how I would do the first bit (reading lines and shoving them into > the "clock" array until I find one that interests me: > > Private Sub Command1_Click() > Dim store(0 To 11) As String > Dim fn As Long, pointer As Long, linenumber As Long > fn = FreeFile > pointer = -1 ' initialise pointer > Open "c:\myfile.txt" For Input As fn > Do > linenumber = linenumber + 1 > pointer = (pointer + 1) Mod 12 > Line Input #fn, store(pointer) > Loop Until store(pointer) = "looking for this" > MsgBox Format(linenumber) & " " & Format(pointer) _ > & " " & Format(store(pointer)) > Close fn > End Sub > > Mike > > > "Ed Wyche" <whocares> wrote in message Again you can use the Mod function or you can simply decrement the value and news:esXrEvRKGHA.516@TK2MSFTNGP15.phx.gbl... > Ok I can see how that works, Now how do I get the results is the > item I want to find is at pointer 5 that means I need to return 5, 4, > 3, 2, 1, 0, 11, 10, 9, 8, 7, 6 in this order. How do I go about that. test for less than zero. The former is neater, but the latter is easier to understand. Try this: pointer = 5 For n = 1 To 12 ' or however many lines you want Print store(pointer) pointer = pointer - 1 If pointer < 0 Then pointer = 11 Next n Mike Thank you I had to do some modifications to get it to work the way I wanted
it to but it work. Thanks for I the hope. Now another question how can I make a dymanic array? Show quoteHide quote "Mike Williams" <M***@WhiskyAndCoke.com> wrote in message news:OVDgr7WKGHA.208@tk2msftngp13.phx.gbl... > "Ed Wyche" <whocares> wrote in message > news:esXrEvRKGHA.516@TK2MSFTNGP15.phx.gbl... > >> Ok I can see how that works, Now how do I get the results is the >> item I want to find is at pointer 5 that means I need to return 5, 4, >> 3, 2, 1, 0, 11, 10, 9, 8, 7, 6 in this order. How do I go about that. > > Again you can use the Mod function or you can simply decrement the value > and test for less than zero. The former is neater, but the latter is > easier to understand. Try this: > > pointer = 5 > For n = 1 To 12 ' or however many lines you want > Print store(pointer) > pointer = pointer - 1 > If pointer < 0 Then pointer = 11 > Next n > > Mike > > > > "Ed Wyche" <whocares> wrote in message You create a dynamic array by first declaring it without specifying the news:O8HRQggKGHA.964@tk2msftngp13.phx.gbl... > Now another question how can I make a dymanic array? number of elements and then later using the Redim statement to specify how many elements you want (or to later change the number of elements you want). For example: Dim store() As String .. . . then at some point in your code: ReDim store(0 To 11) .. . . then perhaps later: ReDim store(0 To 23) .. . . or if you want to preserve the existing contents: ReDim Preserve store(0 To 23) By the way, you can make the code run about four or five times faster using a Byte array instead of strings and by getting the file (either in its entirity or in large chunks) into the Byte array and examining it there. This method would of course require a lot more code to check the data, but it's worth considering if speed is your main concern. However, the existing string code might already run fast enough for you, and it is very much easier to deal with. I've just tried it on a 60 megabyte sample data file containg 1000000 strings of variable length and on my system it loads and checks the individual lines in about two to four seconds, depending on whether you are doing a binary or text compare. If that's fast enough for you then you'd probably be better off sticking with the simplicity of the string method. Mike
Show quote
Hide quote
"Ed Wyche" <d***@care.com> wrote in message It would help to know how your 'lines' are delineated, (how do you knownews:Ozh4liMKGHA.1288@TK2MSFTNGP09.phx.gbl... > I have a text file that is going to have a couple million lines or so in it. > I want to be able to search for a particular number (that number can be in > the file more than once) and then get the 10 lines before and the 10 lines > after that and write that to a file. I have no problem if I am just search > for the one line and return that line. Should I read it all into memory and > then check it or open the file and read each line at a time? > > If someone can give me a code sample I would appreciate it? the size of one line?) and the size of your file... LFS "Ed Wyche" <d***@care.com> wrote in message Define "lines".news:Ozh4liMKGHA.1288@TK2MSFTNGP09.phx.gbl... >I have a text file that is going to have a couple million lines or so in >it. > I want to be able to search for a particular number (that number can be in Millions of lines is going to be too big to grab into memory all in one go,> the file more than once) and then get the 10 lines before and the 10 lines > after that and write that to a file. so your best bet would [probably] be to read the file in "chunks", look for your number in a "chunk", then read the next chunk and look within that. Once you've found an occurence of your number, you can start working around that point to extract the 10 lines either side. Have a look at Open ... For Binary and Seek (both function and Statement forms). > I have no problem if I am just search for the one line and return that Not until to have to wait for it to finish running ... reading line by line > line. can be remarkably slow ... ;-) HTH, Phill W. Probably your best plan would be to read the file in "chunks" I have no problem if I am just search > for the one line and return that line. Should I read it all into memory and> then check it or open the file and read each line at a time? Define "lines".> Show quoteHide quote > If someone can give me a code sample I would appreciate it? > > Thanks > Ed > On Fri, 3 Feb 2006 08:49:23 -0500, "Ed Wyche" <d***@care.com> wrote: Well, I would use eTextMgr from :->I have a text file that is going to have a couple million lines or so in it. >I want to be able to search for a particular number (that number can be in >the file more than once) and then get the 10 lines before and the 10 lines >after that and write that to a file. I have no problem if I am just search >for the one line and return that line. Should I read it all into memory and >then check it or open the file and read each line at a time? http://www.jerryfrench.co.uk/etxtmgr.htm But I wrote that to do the sort of thing you are after Realistically you need to read the file in something like 100k chunks, look for and record the delineators and then look for your data. The actual size of the chunk depends on your disk drive and memory, but at a certain point substituting 1 disk read for two becomes trivial. |
|||||||||||||||||||||||