Home All Groups Group Topic Archive Search About

Efficient way to parse large files

Author
18 Oct 2005 12:43 PM
vm
I use VB6 to parse large text files. Currently I open locally saved text
files with:
open FileName for input as #1

I then read line by line and parse parts of the line:
line input #1, temp
"parse temp for what I need"

The parsed output is stored in a variable until the entire input file is read
whole = whole + ParsedText + vbcrlf

Then the output is written to a file:
open OutFile for output as #2.
print #2, whole

The problem I am running into is the input files are getting larger {250
megs} and the processing time is getting exponentially larger.

Is the above way for parsing a file efficient or is there a more efficient
way to handle parsing these large files. The code is relativily simple and
works well on smaller files, but the large files are killing me. Any tips on
how to optimize it?

Thanks

Author
18 Oct 2005 1:10 PM
Norm Cook
You may gain some speed by reading the entire file into memory

Public Function ReadFileBinary(ByVal sFilePath As String) As String
Dim hFile As Long
hFile = FreeFile
Open sFilePath For Binary As #hFile
ReadFileBinary = String$(LOF(hFile), Chr$(0))
Get #hFile, , ReadFileBinary
Close #hFile
End Function

Usage:
Dim Txt As String
Txt = ReadFileBinary("c:\somepath\somefile.txt")

Then you can use the Split function to get an array to parse

Dim Arr() As String
Dim i As Long
Arr = Split(Txt, vbNewLine)
For i = 0 to UBound(Arr)
'do your parsing
Next

BTW,
> whole = whole + ParsedText + vbcrlf
should be changed to
whole = whole & ParsedText & vbcrlf

Show quoteHide quote
"vm" <v*@discussions.microsoft.com> wrote in message
news:BA78CE8D-66F0-43FB-B208-58C087003F55@microsoft.com...
> I use VB6 to parse large text files. Currently I open locally saved text
> files with:
> open FileName for input as #1
>
> I then read line by line and parse parts of the line:
> line input #1, temp
> "parse temp for what I need"
>
> The parsed output is stored in a variable until the entire input file is
read
> whole = whole + ParsedText + vbcrlf
>
> Then the output is written to a file:
> open OutFile for output as #2.
> print #2, whole
>
> The problem I am running into is the input files are getting larger {250
> megs} and the processing time is getting exponentially larger.
>
> Is the above way for parsing a file efficient or is there a more efficient
> way to handle parsing these large files. The code is relativily simple and
> works well on smaller files, but the large files are killing me. Any tips
on
> how to optimize it?
>
> Thanks
Author
18 Oct 2005 3:57 PM
vm
I have been using VB since version 3 and have always avioded reading files as
binary, not sure why. I tried your suggestion and there is a speed
difference. I did a test on the same file {3 megs & 13000 lines}. My
inefficient way took 3.5 minutes and your suggestion took just over 2
minutes. I am not sure what it will do on the larger files, but it has to be
better than I am doing now.

I will keep trying to find the best way, but thank you for this. It is a big
step in the right direction.

vm

Show quoteHide quote
"Norm Cook" wrote:

> You may gain some speed by reading the entire file into memory
>
> Public Function ReadFileBinary(ByVal sFilePath As String) As String
>  Dim hFile As Long
>  hFile = FreeFile
>  Open sFilePath For Binary As #hFile
>  ReadFileBinary = String$(LOF(hFile), Chr$(0))
>  Get #hFile, , ReadFileBinary
>  Close #hFile
> End Function
>
> Usage:
> Dim Txt As String
> Txt = ReadFileBinary("c:\somepath\somefile.txt")
>
> Then you can use the Split function to get an array to parse
>
> Dim Arr() As String
> Dim i As Long
> Arr = Split(Txt, vbNewLine)
> For i = 0 to UBound(Arr)
>  'do your parsing
> Next
>
> BTW,
> > whole = whole + ParsedText + vbcrlf
> should be changed to
>  whole = whole & ParsedText & vbcrlf
>
> "vm" <v*@discussions.microsoft.com> wrote in message
> news:BA78CE8D-66F0-43FB-B208-58C087003F55@microsoft.com...
> > I use VB6 to parse large text files. Currently I open locally saved text
> > files with:
> > open FileName for input as #1
> >
> > I then read line by line and parse parts of the line:
> > line input #1, temp
> > "parse temp for what I need"
> >
> > The parsed output is stored in a variable until the entire input file is
> read
> > whole = whole + ParsedText + vbcrlf
> >
> > Then the output is written to a file:
> > open OutFile for output as #2.
> > print #2, whole
> >
> > The problem I am running into is the input files are getting larger {250
> > megs} and the processing time is getting exponentially larger.
> >
> > Is the above way for parsing a file efficient or is there a more efficient
> > way to handle parsing these large files. The code is relativily simple and
> > works well on smaller files, but the large files are killing me. Any tips
> on
> > how to optimize it?
> >
> > Thanks
>
>
>
Author
18 Oct 2005 4:47 PM
Larry Serflaten
"vm" <v*@discussions.microsoft.com> wrote
> I have been using VB since version 3 and have always avioded reading files as
> binary, not sure why. I tried your suggestion and there is a speed
> difference. I did a test on the same file {3 megs & 13000 lines}. My
> inefficient way took 3.5 minutes and your suggestion took just over 2
> minutes. I am not sure what it will do on the larger files, but it has to be
> better than I am doing now.
>
> I will keep trying to find the best way, but thank you for this. It is a big
> step in the right direction.

You might want to share some of that parsing code.  It may be that a
significant increase can be found there as well.  As you know, even a small
savings can add up when multiplied by some large number of iterations.

You may already have a fast algoritm, but it can't hurt to get another set
of eyes on the problem (in this case you get several sets in one go...  ;-)

LFS
Author
18 Oct 2005 5:34 PM
Mike Williams
"vm" <v*@discussions.microsoft.com> wrote in message
news:280C39A1-79F0-45D3-B906-804D0458C4F5@microsoft.com...

> I will keep trying to find the best way, but thank you for this.
> [twice the speed] It is a big step in the right direction.

Have you not tried my suggestion of using the Mid$ statement (not the Mid$
function) on a long strings (rather than using repeated string
concatenation)? That should speed increases orders of magnitude better than
you are getting now! Have you tried it?

Mike
Author
20 Oct 2005 1:10 PM
vm
I am testing the MID$ suggestion you made. I have not done it that way
before. I will have a seperate app doing it that way to benchmark the speed
difference.

Thanks

vm

Show quoteHide quote
"Mike Williams" wrote:

> "vm" <v*@discussions.microsoft.com> wrote in message
> news:280C39A1-79F0-45D3-B906-804D0458C4F5@microsoft.com...
>
> > I will keep trying to find the best way, but thank you for this.
> > [twice the speed] It is a big step in the right direction.
>
> Have you not tried my suggestion of using the Mid$ statement (not the Mid$
> function) on a long strings (rather than using repeated string
> concatenation)? That should speed increases orders of magnitude better than
> you are getting now! Have you tried it?
>
> Mike
>
>
>
>
>
Author
18 Oct 2005 1:18 PM
Mike Williams
"vm" <v*@discussions.microsoft.com> wrote in message
news:BA78CE8D-66F0-43FB-B208-58C087003F55@microsoft.com...

> The parsed output is stored in a variable until the entire input file is
> read
> whole = whole + ParsedText + vbcrlf

You haven't said what it is you're looking for in the file so its hard to
give a definite answer but it may be faster if you open the file as Binary
and use Get to read and digest large chunks of it as many times as is
required. However, one thing you are currently doing that is *definitely*
slowing your code down is your repeated concatenation of the output string.
Thnat is *very* slow, especially as the string grows in length, because in
order to add each substring to the main string VB has to create another
string entirely and then throw the original away. It would be far quicker to
initially create a very long "output string" and then maintain a simple
"pointer" telling you where to insert the next substring. Something like:

Dim whole As String, ParsedText As String
Dim p1 As Long, p2 As Long

whole = Space$(500000) ' initialise output string
p1 = 1 ' initialise pointer
' then each time you want to add substring you could do:
p2 = Len(ParsedText)
If (Len(whole) - p1) < (p1 + 2) Then
  whole = whole & Space$(500000)
End If
Mid$(whole, p1) = ParsedText ' this is very fast
Mid$(whole, p1 + p2) = vbCrLf ' and so is this
p1 = p1 + p2 + 2

Then when you have finsihed you trim the output strting to its correct size4
in accordance with the value held in p1.

Mike





Show quoteHide quote
>
> Then the output is written to a file:
> open OutFile for output as #2.
> print #2, whole
>
> The problem I am running into is the input files are getting larger {250
> megs} and the processing time is getting exponentially larger.
>
> Is the above way for parsing a file efficient or is there a more efficient
> way to handle parsing these large files. The code is relativily simple and
> works well on smaller files, but the large files are killing me. Any tips
> on
> how to optimize it?
>
> Thanks
Author
18 Oct 2005 1:48 PM
J French
On Tue, 18 Oct 2005 05:43:05 -0700, "=?Utf-8?B?dm0=?="
<v*@discussions.microsoft.com> wrote:

>I use VB6 to parse large text files. Currently I open locally saved text
>files with:

Never use #1  -  always use FreeFile

<snip>
>Is the above way for parsing a file efficient or is there a more efficient
>way to handle parsing these large files. The code is relativily simple and
>works well on smaller files, but the large files are killing me. Any tips on
>how to optimize it?

Here is a VB Class that will do what you want
- and rather faster

Disk access is slow.

VERSION 1.0 CLASS
BEGIN
  MultiUse = -1  'True
END
Attribute VB_Name = "cReadFileStream"
Attribute VB_GlobalNameSpace = False
Attribute VB_Creatable = True
Attribute VB_PredeclaredId = False
Attribute VB_Exposed = False
Option Explicit

' 2/8/01 JF
' 3/8/01 JF  -   Block Read Added - watch for Block > File Size
'

Private Type TCMN
    FileName As String
    FileSize As Long
    Delin As String
    Buffer As String
    BufferLen As Long
    BufferPos As Long
    BytesDone As Long
    EofFlag As Boolean
    Channel As Integer
End Type

Private cmn As TCMN

' ---
Private Sub Class_Initialize()
   cmn.Delin = vbCrLf
   cmn.BufferLen = 100000
End Sub

' ---
Public Function Create(FileName$) As Boolean

    cmn.FileName = FileName
    Create = False
    cmn.Buffer = ""
    cmn.Channel = 0
    cmn.EofFlag = False
    cmn.BufferPos = 1
    cmn.BytesDone = 0
    ' ---
    If FileExists(FileName$) = False Then
       MsgBox "cReadFileStream: " + FileName$ _
               + "File not Found"
       Exit Function
    End If
    ' ---
    If FileExists(FileName$) Then
       cmn.FileSize = FileLen(cmn.FileName)
       cmn.Channel = FreeFile
       Open FileName For Binary Access Read As #cmn.Channel
       Create = True
    End If
End Function

' ---
Public Function ReadDelineatedLine() As String
    Dim Q&, L&

    If cmn.Channel = 0 Then
       MsgBox "cReadFileStream - ReadLine - but file not Open"
       cmn.EofFlag = True
       Exit Function
    End If
    ' ---
    If cmn.EofFlag Then
       MsgBox "cReadFileStream - Read Past End of File"
       Exit Function
    End If
    ' ---
    If InStr(cmn.BufferPos, cmn.Buffer, cmn.Delin) = 0 Then
       Call LS_FillBuffer
       ' --- When File completely Read then append Delin if Needed
       If cmn.BytesDone = cmn.FileSize Then
          If Right$(cmn.Buffer, Len(cmn.Delin)) <> cmn.Delin Then
             cmn.Buffer = cmn.Buffer + cmn.Delin
          End If
       End If
    End If

    ' ---
    Q = InStr(cmn.BufferPos, cmn.Buffer, cmn.Delin)
    If Q Then
       L = Q - cmn.BufferPos
       ReadDelineatedLine = Mid$(cmn.Buffer, cmn.BufferPos, L)
       cmn.BufferPos = Q + Len(cmn.Delin)
    End If
    If Q = 0 Then
       MsgBox "cReadFileStream - Read - Unexpected Error" _
               + vbCrLf + "Delineator not Found"
    End If

    ' --- Was this the last Field of the Last Buffer
    If cmn.BytesDone >= cmn.FileSize Then
       If Q >= Len(cmn.Buffer) - Len(cmn.Delin) Then
          cmn.EofFlag = True
       End If
    End If
End Function

' ---
Public Sub ReadBlock(Block$)
    Dim BlockLen&, Q&

    If cmn.Channel = 0 Then
       MsgBox "cReadFileStream - ReadBlock - but file not Open"
       cmn.EofFlag = True
       Exit Sub
    End If
    ' ---
    If cmn.EofFlag Then
       MsgBox "cReadFileStream - Read Past End of File"
       Exit Sub
    End If

    ' ---
    BlockLen& = Len(Block$)

    ' --- Do we need to fill the Buffer
    If (cmn.BufferPos + BlockLen) > Len(cmn.Buffer) Then
       If BlockLen > cmn.BufferLen Then  ' increase buffer size
          cmn.BufferLen = cmn.BufferPos + BlockLen
       End If
       Call LS_FillBuffer
    End If

    ' --- If insufficient Data left
    Q = Len(cmn.Buffer$) - cmn.BufferPos + 1   ' Bytes Left
    If BlockLen > Q Then
       Block$ = Space$(Q)
       BlockLen = Q
    End If

    ' --- Copy the data
    Mid$(Block$, 1, BlockLen) = Mid$(cmn.Buffer$, cmn.BufferPos,
BlockLen)
    cmn.BufferPos = cmn.BufferPos + BlockLen

    ' --- Was this the last Field of the Last Buffer
    If cmn.BytesDone >= cmn.FileSize Then
       If cmn.BufferPos > Len(cmn.Buffer$) Then
          cmn.EofFlag = True
       End If
    End If

End Sub


' ---
Public Function EofFlag() As Boolean
    EofFlag = cmn.EofFlag
End Function

' ---
Public Function Size() As Long
    Size = cmn.FileSize
End Function

' ---
Public Sub Free()
    If cmn.Channel <> 0 Then
       Close #cmn.Channel
       cmn.Channel = 0
    End If
End Sub

' ---
Private Sub LS_FillBuffer()
   Dim Hold$, Q&

   ' --- First time in cmn.Buffer = ""
   Hold$ = Mid$(cmn.Buffer, cmn.BufferPos)

   If cmn.BytesDone >= cmn.FileSize Then
      Exit Sub
   End If

   ' ---
   If Len(cmn.Buffer) < cmn.BufferLen Then
      cmn.Buffer = Space$(cmn.BufferLen)
   End If

   ' --- Reduce Buffer Size at End of File
   Q = cmn.FileSize - cmn.BytesDone
   If Q < Len(cmn.Buffer) Then
      cmn.Buffer = Space$(Q)
   End If

   ' --- Read a Chunk
   Get #cmn.Channel, cmn.BytesDone + 1, cmn.Buffer
   cmn.BytesDone = cmn.BytesDone + Len(cmn.Buffer)

   ' --- Add leftover chunk if needed
   If Len(Hold$) Then
      cmn.Buffer = Hold + cmn.Buffer
   End If
   ' ---
   cmn.BufferPos = 1

End Sub

Private Sub Class_Terminate()
   Me.Free
End Sub

'
' Support Routines
'
Function FileExists(Fle$) As Boolean
    Dim Q%
    On Error Resume Next
    Q = GetAttr(Fle$)
    If Err = 0 Then
       If (Q And vbDirectory) = 0 Then
          FileExists = True
       End If
    End If
    Err.Clear
End Function
Author
18 Oct 2005 2:06 PM
Rick Rothstein [MVP - Visual Basic]
> I use VB6 to parse large text files.

Can you describe your parsing operation to us? Are you simply
replacing text for text or is there more to it than that?

Rick
Author
18 Oct 2005 3:13 PM
vm
I am actually going through large log files and extracting what I need. I can
go from 250 megs and parse it down to about 10 megs of relevant data. The
actual parsing uses instr, right$, & left$ to pull specific bits of info.
Each line that contains a target string is then parsed into specific fields
for that row {record}. This project kind of took off from a small "good idea"
and because of the file size is in need of optimization.

Show quoteHide quote
"Rick Rothstein [MVP - Visual Basic]" wrote:

> > I use VB6 to parse large text files.
>
> Can you describe your parsing operation to us? Are you simply
> replacing text for text or is there more to it than that?
>
> Rick
>
>
>
Author
18 Oct 2005 4:27 PM
Larry Serflaten
"vm" <v*@discussions.microsoft.com> wrote in message news:7165EC72-34AC-4AB3-8A8B-19131181298B@microsoft.com...
> I am actually going through large log files and extracting what I need. I can
> go from 250 megs and parse it down to about 10 megs of relevant data. The
> actual parsing uses instr, right$, & left$ to pull specific bits of info.
> Each line that contains a target string is then parsed into specific fields
> for that row {record}. This project kind of took off from a small "good idea"
> and because of the file size is in need of optimization.

Can I suggest you simply open both files, one for input and one for output,
and proceed as before, only write out the parsed text immediately, instead of
saving it to a string.

(adjusted repost of your initial algorithm: )

Was:

open FileName for input as #1
line input #1, temp
"parse temp for what I need"
whole = whole + ParsedText + vbcrlf

Then the output is written to a file:
open OutFile for output as #2.
print #2, whole


Try:

open FileName for input as #1
open OutFile for output as #2.
line input #1, temp
"parse temp for what I need"
print #2, ParsedText


Its a minimum amount of change from what you have and should
offer some savings in speed.  Do take note of Rick's post about
using FreeFile.  It actually should look more like:

FileIn = FreeFile
Open InputFile For Input As FileIn Len = 4096
FileOut = FreeFile
Open OutputFile For Output As FileOut Len = 16384
Do While Not EOF(FileIn)
  Line Input #FileIn, temp
  Print FileOut, ParsedText(temp)
Loop
Close

(Where ParsedText is a function that accepts the input lines and returns
the desired output from that text....)

HTH
LFS
Author
18 Oct 2005 4:55 PM
Larry Serflaten
"Larry Serflaten" <serfla***@usinternet.com> wrote

> ... Do take note of Rick's post about
> using FreeFile.


My mistake, it was J French's post....

Sorry for the confusion!
LFS
Author
18 Oct 2005 2:37 PM
Larry Rebich
Some years ago I wrote a tip of the month that I call 'Using ADO to Read and
Parse a Text File'. Link to http://www.buygold.net/tips then look for the
April 2002 tip of the month. A sample program is provided. From the tip's
intro:

ADO [Active Data Objects] can be used to read a variety of file formats. I
have used it to read and parse a CSV and fixed width formatted text file.
ADO reads and parses the file into a recordset. When I first needed to parse
a text file I tried various routines - some provided via other users - and
none seemed satisfactory. So I did some research and discovered that ADO
will perform the task. I've revisited the code, wrote a demo program and
made it the April 2002 tip-of-the-month.

--
Cheers,
Larry Rebich

More tips link to:
http://www.buygold.net/tips
Author
18 Oct 2005 4:08 PM
Someone
You have to read a block of text, say 4096 bytes, then divide it into lines.
You have to account for small files and the last block because it usually
less than 4096 bytes. You also have to account for partial lines at the end
of a block and join them with the next block.



Show quoteHide quote
"vm" <v*@discussions.microsoft.com> wrote in message
news:BA78CE8D-66F0-43FB-B208-58C087003F55@microsoft.com...
>I use VB6 to parse large text files. Currently I open locally saved text
> files with:
> open FileName for input as #1
>
> I then read line by line and parse parts of the line:
> line input #1, temp
> "parse temp for what I need"
>
> The parsed output is stored in a variable until the entire input file is
> read
> whole = whole + ParsedText + vbcrlf
>
> Then the output is written to a file:
> open OutFile for output as #2.
> print #2, whole
>
> The problem I am running into is the input files are getting larger {250
> megs} and the processing time is getting exponentially larger.
>
> Is the above way for parsing a file efficient or is there a more efficient
> way to handle parsing these large files. The code is relativily simple and
> works well on smaller files, but the large files are killing me. Any tips
> on
> how to optimize it?
>
> Thanks
Author
18 Oct 2005 5:23 PM
Someone
Try this untested routine. Put your parsing code in ProcessLine function. It
reads 32768 bytes at a time and calls ProcessLine for each line it
encounters. This routine does not work correctly if another process expanded
or truncated the file while you are parsing it.

Public Function ProcessFile(ByRef FileName As String) As Long
    Dim f As Integer
    Dim d32768 As Long ' Number of 32768 bytes blocks, including partial
blocks
    Dim m32768 As Long ' Number of bytes in the last or only block
    Dim i As Long
    Dim LastEndOfLinePos As Long
    Dim NextEndOfLinePos As Long
    Dim Buffer As String
    Dim sLine As String

    ' Allocate the buffer
    Buffer = String(32768, 0)

    f = FreeFile
    Open FileName For Binary As #f
    d32768 = LOF(f) \ 32768 + 1
    m32768 = LOF(f) Mod 32768

    For i = 1 To d32768
        If i = d32768 Then
            ' Adjust the buffer for the last or the only block
            Buffer = String(1, m32768)
        End If
        Get f, , Buffer

        Do While NextEndOfLinePos <> 0
            NextEndOfLinePos = InStr(LastEndOfLinePos + 1, Buffer, vbCrLf,
vbBinaryCompare)
            If NextEndOfLinePos = 0 Then
                ' Partial line found at the end of the block
                sLine = Mid(Buffer, LastEndOfLinePos + 2)
            Else
                sLine = sLine & Mid(Buffer, LastEndOfLinePos + 2,
LastEndOfLinePos - NextEndOfLinePos)
                ' Process the line
                ProcessLine sLine
                sLine = "" ' Get ready for the next line
            End If
            LastEndOfLinePos = NextEndOfLinePos
        Loop
    Next

    If sLine <> "" Then
        ' Last line in the file did not have CrLF
        ProcessLine sLine
    End If

    Close f

End Function

Private Function ProcessLine(ByRef sLine As String) As Long

End Function





Show quoteHide quote
"vm" <v*@discussions.microsoft.com> wrote in message
news:BA78CE8D-66F0-43FB-B208-58C087003F55@microsoft.com...
>I use VB6 to parse large text files. Currently I open locally saved text
> files with:
> open FileName for input as #1
>
> I then read line by line and parse parts of the line:
> line input #1, temp
> "parse temp for what I need"
>
> The parsed output is stored in a variable until the entire input file is
> read
> whole = whole + ParsedText + vbcrlf
>
> Then the output is written to a file:
> open OutFile for output as #2.
> print #2, whole
>
> The problem I am running into is the input files are getting larger {250
> megs} and the processing time is getting exponentially larger.
>
> Is the above way for parsing a file efficient or is there a more efficient
> way to handle parsing these large files. The code is relativily simple and
> works well on smaller files, but the large files are killing me. Any tips
> on
> how to optimize it?
>
> Thanks
Author
18 Oct 2005 9:54 PM
Karl E. Peterson
vm wrote:
> The parsed output is stored in a variable until the entire input file
> is read whole = whole + ParsedText + vbcrlf

Mike Williams probably nailed it.  Concatenation is a killer.  Especially building up
to a 10Mb file!  (I think I saw you say that, somewhere.)

Anyway, you need to either adopt his Mid$ suggestion, or take it to the next level
and use something like http://vb.mvps.org/samples/StrBldr
--
Working Without a .NET?
http://classicvb.org/petition