Home All Groups Group Topic Archive Search About
Author
3 Feb 2006 1:49 PM
Ed Wyche
I have a text file that is going to have a couple million lines or so in it.
I want to be able to search for a particular number (that number can be in
the file more than once) and then get the 10 lines before and the 10 lines
after that and write that to a file.  I have no problem if I am just search
for the one line and return that line.  Should I read it all into memory and
then check it or open the file and read each line at a time?

If someone can give me a code sample I would appreciate it?

Thanks
Ed

Author
3 Feb 2006 2:11 PM
Rick Rothstein [MVP - Visual Basic]
> I have a text file that is going to have a couple million lines
> or so in it. I want to be able to search for a particular number
> (that number can be in the file more than once) and then get
> the 10 lines before and the 10 lines after that and write that
> to a file.  I have no problem if I am just search for the one
> line and return that line.  Should I read it all into memory and
> then check it or open the file and read each line at a time?

It kind of depends. What do you mean by a "line" in your file? Are all your
lines single "sentences" terminated by a "newline" character sequence? Or
are they multiple sentence paragraphs, again, terminated by a "newline"
character sequence? Or are you looking for each sentence even if there are
more than one of them in a paragraph? I ask because a Line Input statement
(what I'm guessing you are referring to when you say "read each line") uses
the "newline" character sequence to decide what a line is. Also, to decide
on whether to read the entire file into memory before  parsing it, you
should tell us how large (in megabytes) your average and largest files are
(that is, are your lines short and we are talking about 20 to 50 Megs, or
are they large and we are talking about a half a Gig, or what)?

Rick
Author
3 Feb 2006 2:39 PM
Ed Wyche
This is basically a log file with errors in it.  Each error is on a single
line.  I am using Input # right now and I can find the specific line but I
am having a problem getting the 10 lines before and the 10 lines after.  The
file can be about 100 megs or so.

Show quoteHide quote
"Rick Rothstein [MVP - Visual Basic]" <rickNOSPAMnews@NOSPAMcomcast.net>
wrote in message news:OctyruMKGHA.3984@TK2MSFTNGP14.phx.gbl...
>> I have a text file that is going to have a couple million lines
>> or so in it. I want to be able to search for a particular number
>> (that number can be in the file more than once) and then get
>> the 10 lines before and the 10 lines after that and write that
>> to a file.  I have no problem if I am just search for the one
>> line and return that line.  Should I read it all into memory and
>> then check it or open the file and read each line at a time?
>
> It kind of depends. What do you mean by a "line" in your file? Are all
> your
> lines single "sentences" terminated by a "newline" character sequence? Or
> are they multiple sentence paragraphs, again, terminated by a "newline"
> character sequence? Or are you looking for each sentence even if there are
> more than one of them in a paragraph? I ask because a Line Input statement
> (what I'm guessing you are referring to when you say "read each line")
> uses
> the "newline" character sequence to decide what a line is. Also, to decide
> on whether to read the entire file into memory before  parsing it, you
> should tell us how large (in megabytes) your average and largest files are
> (that is, are your lines short and we are talking about 20 to 50 Megs, or
> are they large and we are talking about a half a Gig, or what)?
>
> Rick
>
>
Author
3 Feb 2006 3:02 PM
Larry Serflaten
"Ed Wyche" <d***@care.com> wrote
> This is basically a log file with errors in it.  Each error is on a single
> line.  I am using Input # right now and I can find the specific line but I
> am having a problem getting the 10 lines before and the 10 lines after.
The
> file can be about 100 megs or so.

Now, what do you want to do with those 10 lines?

If it were me, I'd set up a (FIFO) queue that holds 10 lines.  I'd read
each line and check it for inclusion and place it in the queue. When the
queue gets to be 10 lines long, then each addition to the queue also needs
to pull one off the queue.  If the current line is supposed to be 'kept',
I'd start
a countdown variable that would countdown from 20, and save the next
21 lines as they are being removed from the queue.

That would mean the entire file would be read a line at a time and all the
lines would pass through the queue.  It might be a relatively long process,
but perhaps under a few minutes.

The queue could be a Collection and you would fill it by adding the lines
to the collection.  If the Count is greater than 10 after you add a new
line,
then the line stored in the first position has to be removed, etc...

It would be easy to set up, try it and see!

LFS
Author
3 Feb 2006 3:13 PM
Ed Wyche
What I need to is to kept the 10 lines before the inclusion and the 10 lines
after the inclusion and the inclusion saved to a file.  There is going to be
more then one so it will have to go all the way through the file.

Show quoteHide quote
"Larry Serflaten" <serfla***@usinternet.com> wrote in message
news:OHYZ9KNKGHA.3064@TK2MSFTNGP10.phx.gbl...
>
> "Ed Wyche" <d***@care.com> wrote
>> This is basically a log file with errors in it.  Each error is on a
>> single
>> line.  I am using Input # right now and I can find the specific line but
>> I
>> am having a problem getting the 10 lines before and the 10 lines after.
> The
>> file can be about 100 megs or so.
>
> Now, what do you want to do with those 10 lines?
>
> If it were me, I'd set up a (FIFO) queue that holds 10 lines.  I'd read
> each line and check it for inclusion and place it in the queue. When the
> queue gets to be 10 lines long, then each addition to the queue also needs
> to pull one off the queue.  If the current line is supposed to be 'kept',
> I'd start
> a countdown variable that would countdown from 20, and save the next
> 21 lines as they are being removed from the queue.
>
> That would mean the entire file would be read a line at a time and all the
> lines would pass through the queue.  It might be a relatively long
> process,
> but perhaps under a few minutes.
>
> The queue could be a Collection and you would fill it by adding the lines
> to the collection.  If the Count is greater than 10 after you add a new
> line,
> then the line stored in the first position has to be removed, etc...
>
> It would be easy to set up, try it and see!
>
> LFS
>
>
Author
3 Feb 2006 8:30 PM
Larry Serflaten
"Ed Wyche" <d***@care.com> wrote
> What I need to is to kept the 10 lines before the inclusion and the 10
lines
> after the inclusion and the inclusion saved to a file.  There is going to
be
> more then one so it will have to go all the way through the file.

Precisely.  When you find a line that needs saving, there will be 10 lines
in the queue (unless that particular line is one of the first 10 in the
file) so
you start a counter down from 21 and save all the lines that come off the
queue until the counter is at 0.  That would save the 10 lines ahead of the
included line, the included line itself, plus10 lines following the included
line.

In simple terms, you basically cycle all the lines into and out of the
queue.
When a line needs saving you set a flag to indicate the lines coming out of
the queue need to be saved.  That flag is a counter that determines how
many lines to save.  Since you have 10 lines in the queue when you find a
line, you'll get 10 lines saved ahead of the line you found.  After you set
the
counter to 21, you save every line that comes off the queue until that
counter
reaches 0.  So thats 10 ahead, the line itself, and 10 after, (21) that is
the group
you're looking for....

LFS
Author
3 Feb 2006 9:50 PM
Ed Wyche
Can you give me an example please.

Show quoteHide quote
"Larry Serflaten" <serfla***@usinternet.com> wrote in message
news:eXcXnCQKGHA.1452@TK2MSFTNGP10.phx.gbl...
>
> "Ed Wyche" <d***@care.com> wrote
>> What I need to is to kept the 10 lines before the inclusion and the 10
> lines
>> after the inclusion and the inclusion saved to a file.  There is going to
> be
>> more then one so it will have to go all the way through the file.
>
> Precisely.  When you find a line that needs saving, there will be 10 lines
> in the queue (unless that particular line is one of the first 10 in the
> file) so
> you start a counter down from 21 and save all the lines that come off the
> queue until the counter is at 0.  That would save the 10 lines ahead of
> the
> included line, the included line itself, plus10 lines following the
> included
> line.
>
> In simple terms, you basically cycle all the lines into and out of the
> queue.
> When a line needs saving you set a flag to indicate the lines coming out
> of
> the queue need to be saved.  That flag is a counter that determines how
> many lines to save.  Since you have 10 lines in the queue when you find a
> line, you'll get 10 lines saved ahead of the line you found.  After you
> set
> the
> counter to 21, you save every line that comes off the queue until that
> counter
> reaches 0.  So thats 10 ahead, the line itself, and 10 after, (21) that is
> the group
> you're looking for....
>
> LFS
>
>
Author
4 Feb 2006 8:29 PM
Larry Serflaten
"Ed Wyche" <whocares> wrote
> Can you give me an example please.

It looks like you already have something going with Mike.
Perhaps that is a better route....

LFS
Author
3 Feb 2006 4:05 PM
Mike Williams
"Ed Wyche" <d***@care.com> wrote in message
news:etSac%23MKGHA.2712@TK2MSFTNGP10.phx.gbl...

> This is basically a log file with errors in it.  Each error is on a single
> line.  I am using Input # right now and I can find the specific line but
> I am having a problem getting the 10 lines before . . .

If I were you I would keep a continuous record of the last dozen or so lines
read and keep reading new lines until you finf the one that interests you.
For example, you could have a simple string array (o to 11) which you could
visualise as a 12 hour clock, but numbered from 0 to 11 instead of 1 to 12.
Just dump each line as it is read into the appropriate element of the array,
each time moving to the next digit around the clock. Keep a running record
of the current line number and the "clock" digit. Then when you find a line
that interests you it is a simple matter to look back through the array to
pick out the ten or so preceeding lines and then run a simple loop to pick
out the following ten or so following lines. Here's how I would do the first
bit (reading lines and shoving them into the "clock" array until I find one
that interests me:

Private Sub Command1_Click()
Dim store(0 To 11) As String
Dim fn As Long, pointer As Long, linenumber As Long
fn = FreeFile
pointer = -1 ' initialise pointer
Open "c:\myfile.txt" For Input As fn
Do
  linenumber = linenumber + 1
  pointer = (pointer + 1) Mod 12
  Line Input #fn, store(pointer)
Loop Until store(pointer) = "looking for this"
MsgBox Format(linenumber) & "  " & Format(pointer) _
  & "  " & Format(store(pointer))
Close fn
End Sub

Mike
Author
3 Feb 2006 11:44 PM
Ed Wyche
Ok I can see how that works,  Now how do I get the results is the item I
want to find is at pointer 5 that means I need to return 5, 4, 3, 2, 1, 0,
11, 10, 9, 8, 7, 6 in this order.  How do I go about that.  Also I am
looking at make it a variable of how many lines to return.


Show quoteHide quote
"Mike Williams" <M***@WhiskyAndCoke.com> wrote in message
news:OtMiuuNKGHA.500@TK2MSFTNGP15.phx.gbl...
> "Ed Wyche" <d***@care.com> wrote in message
> news:etSac%23MKGHA.2712@TK2MSFTNGP10.phx.gbl...
>
>> This is basically a log file with errors in it.  Each error is on a
>> single line.  I am using Input # right now and I can find the specific
>> line but
>> I am having a problem getting the 10 lines before . . .
>
> If I were you I would keep a continuous record of the last dozen or so
> lines read and keep reading new lines until you finf the one that
> interests you. For example, you could have a simple string array (o to 11)
> which you could visualise as a 12 hour clock, but numbered from 0 to 11
> instead of 1 to 12. Just dump each line as it is read into the appropriate
> element of the array, each time moving to the next digit around the clock.
> Keep a running record of the current line number and the "clock" digit.
> Then when you find a line that interests you it is a simple matter to look
> back through the array to pick out the ten or so preceeding lines and then
> run a simple loop to pick out the following ten or so following lines.
> Here's how I would do the first bit (reading lines and shoving them into
> the "clock" array until I find one that interests me:
>
> Private Sub Command1_Click()
> Dim store(0 To 11) As String
> Dim fn As Long, pointer As Long, linenumber As Long
> fn = FreeFile
> pointer = -1 ' initialise pointer
> Open "c:\myfile.txt" For Input As fn
> Do
>  linenumber = linenumber + 1
>  pointer = (pointer + 1) Mod 12
>  Line Input #fn, store(pointer)
> Loop Until store(pointer) = "looking for this"
> MsgBox Format(linenumber) & "  " & Format(pointer) _
>  & "  " & Format(store(pointer))
> Close fn
> End Sub
>
> Mike
>
>
>
Author
4 Feb 2006 9:39 AM
Mike Williams
"Ed Wyche" <whocares> wrote in message
news:esXrEvRKGHA.516@TK2MSFTNGP15.phx.gbl...

> Ok I can see how that works,  Now how do I get the results is the
> item I want to find is at pointer 5 that means I need to return 5, 4,
> 3, 2, 1, 0, 11, 10, 9, 8, 7, 6 in this order.  How do I go about that.

Again you can use the Mod function or you can simply decrement the value and
test for less than zero. The former is neater, but the latter is easier to
understand. Try this:

pointer = 5
For n = 1 To 12 ' or however many lines you want
  Print store(pointer)
  pointer = pointer - 1
  If pointer < 0 Then pointer = 11
Next n

Mike
Author
5 Feb 2006 3:55 AM
Ed Wyche
Thank you I had to do some modifications to get it to work the way I wanted
it to but it work.  Thanks for I the hope.

Now another question how can I make a dymanic array?

Show quoteHide quote
"Mike Williams" <M***@WhiskyAndCoke.com> wrote in message
news:OVDgr7WKGHA.208@tk2msftngp13.phx.gbl...
> "Ed Wyche" <whocares> wrote in message
> news:esXrEvRKGHA.516@TK2MSFTNGP15.phx.gbl...
>
>> Ok I can see how that works,  Now how do I get the results is the
>> item I want to find is at pointer 5 that means I need to return 5, 4,
>> 3, 2, 1, 0, 11, 10, 9, 8, 7, 6 in this order.  How do I go about that.
>
> Again you can use the Mod function or you can simply decrement the value
> and test for less than zero. The former is neater, but the latter is
> easier to understand. Try this:
>
> pointer = 5
> For n = 1 To 12 ' or however many lines you want
>  Print store(pointer)
>  pointer = pointer - 1
>  If pointer < 0 Then pointer = 11
> Next n
>
> Mike
>
>
>
>
Author
5 Feb 2006 7:03 AM
Mike Williams
"Ed Wyche" <whocares> wrote in message
news:O8HRQggKGHA.964@tk2msftngp13.phx.gbl...

> Now another question how can I make a dymanic array?

You create a dynamic array by first declaring it without specifying the
number of elements and then later using the Redim statement to specify how
many elements you want (or to later change the number of elements you want).
For example:

Dim store() As String
.. . . then at some point in your code:
ReDim store(0 To 11)
.. . . then perhaps later:
ReDim store(0 To 23)
.. . . or if you want to preserve the existing contents:
ReDim Preserve store(0 To 23)

By the way, you can make the code run about four or five times faster using
a Byte array instead of strings and by getting the file (either in its
entirity or in large chunks) into the Byte array and examining it there.
This method would of course require a lot more code to check the data, but
it's worth considering if speed is your main concern. However, the existing
string code might already run fast enough for you, and it is very much
easier to deal with. I've just tried it on a 60 megabyte sample data file
containg 1000000 strings of variable length and on my system it loads and
checks the individual lines in about two to four seconds, depending on
whether you are doing a binary or text compare. If that's fast enough for
you then you'd probably be better off sticking with the simplicity of the
string method.

Mike
Author
3 Feb 2006 2:33 PM
Larry Serflaten
Show quote Hide quote
"Ed Wyche" <d***@care.com> wrote in message
news:Ozh4liMKGHA.1288@TK2MSFTNGP09.phx.gbl...
> I have a text file that is going to have a couple million lines or so in
it.
> I want to be able to search for a particular number (that number can be in
> the file more than once) and then get the 10 lines before and the 10 lines
> after that and write that to a file.  I have no problem if I am just
search
> for the one line and return that line.  Should I read it all into memory
and
> then check it or open the file and read each line at a time?
>
> If someone can give me a code sample I would appreciate it?


It would help to know how your 'lines' are delineated, (how do you know
the size of one line?) and the size of your file...

LFS
Author
3 Feb 2006 3:48 PM
Phill W.
"Ed Wyche" <d***@care.com> wrote in message
news:Ozh4liMKGHA.1288@TK2MSFTNGP09.phx.gbl...
>I have a text file that is going to have a couple million lines or so in
>it.


Define "lines".

> I want to be able to search for a particular number (that number can be in
> the file more than once) and then get the 10 lines before and the 10 lines
> after that and write that to a file.

Millions of lines is going to be too big to grab into memory all in one go,
so your best bet would [probably] be to read the file in "chunks", look
for your number in a "chunk", then read the next chunk and look within
that.  Once you've found an occurence of your number, you can start
working around that point to extract the 10 lines either side.

Have a look at
    Open ... For Binary
and
    Seek (both function and Statement forms).

> I have no problem if I am just search for the one line and return that
> line.

Not until to have to wait for it to finish running ... reading line by line
can be
remarkably slow ... ;-)

HTH,
    Phill  W.


Probably your best plan would be to read the file in "chunks"

I have no problem if I am just search
> for the one line and return that line.
Should I read it all into memory and
> then check it or open the file and read each line at a time?
>

Define "lines".


Show quoteHide quote
> If someone can give me a code sample I would appreciate it?
>
> Thanks
> Ed
>
Author
3 Feb 2006 4:30 PM
J French
On Fri, 3 Feb 2006 08:49:23 -0500, "Ed Wyche" <d***@care.com> wrote:

>I have a text file that is going to have a couple million lines or so in it.
>I want to be able to search for a particular number (that number can be in
>the file more than once) and then get the 10 lines before and the 10 lines
>after that and write that to a file.  I have no problem if I am just search
>for the one line and return that line.  Should I read it all into memory and
>then check it or open the file and read each line at a time?

Well, I would use eTextMgr from :-

http://www.jerryfrench.co.uk/etxtmgr.htm

But I wrote that to do the sort of thing you are after

Realistically you need to read the file in something like 100k chunks,
look for and record the delineators and then look for your data.

The actual size of the chunk depends on your disk drive and memory,
but at a certain point substituting 1 disk read for two becomes
trivial.