Home All Groups Group Topic Archive Search About

Parsing UNIX text files

Author
21 Oct 2005 7:43 AM
Neale
Hi all,

I have a parsing speed issue that is causing me some problems.  I am parsing
plain text files in the multi-Gigabyte size range, and I am using the
following code;

Line Input #sngSourceFile, mvarTraceLine

Problem is that if the file I am parsing is created in UNIX, the vbCrLf is
not picked up and therefore the line being read is not the line I expect.

Initially to solve this issue I have been reading lines char by char using
the following code;

    Do While True
        strFileChar = Input(1, #sngTraceFileNum)
        If Asc(strFileChar) = 10 Then
            ' A line feed character has been found
            Exit Do
        Else
            ' Build up the line string
            strLineText = strLineText & strFileChar
        End If
    Loop

Of course, this takes about 10,000 years!!!

So my question is - is there any way to change the line delimiter used for
"Line Input" so that it recognises the correct end-of-line?

Thanks for any suggestions.

Author
21 Oct 2005 7:58 AM
Rick Rothstein [MVP - Visual Basic]
> I have a parsing speed issue that is causing me some problems.
I am parsing
> plain text files in the multi-Gigabyte size range, and I am
using the
> following code;
>
> Line Input #sngSourceFile, mvarTraceLine
>
> Problem is that if the file I am parsing is created in UNIX, the
vbCrLf is
> not picked up and therefore the line being read is not the line
I expect.

UNIX uses the Line Feed character by itself as its newline
delimiter (and as a point of information, MACs use the Carriage
Return by itself) and, as you already know, VB (actually Windows)
uses the combination of a Carriage Return followed by a Line Feed.

> So my question is - is there any way to change the line
delimiter used for
> "Line Input" so that it recognises the correct end-of-line?

Not to my knowledge. Back in the mid-1980s when I worked in UNIX I
seem to remember that our network server software took care of the
translations of the newline character sequences whenever Windows
retrieved a file across the network. I'm not really sure what to
tell you for your problem though. If this were a smaller file, you
could read the entire file in using Binary read mode and Split on
the vbLf character, but that won't work for a multi-Gig file.
Perhaps you could read in the file in smaller, more managable
"chunks" (say 5 or 10 Megs) using a Binary read mode and Split
these "chunks" on the vbLf character. Of course, you would have to
handle split "fields" for those chunks (the majority of them I
would image) that didn't fall on a Line Feed boundary.

Rick
Author
21 Oct 2005 8:24 AM
J French
On Fri, 21 Oct 2005 03:58:39 -0400, "Rick Rothstein [MVP - Visual
Basic]" <rickNOSPAMnews@NOSPAMcomcast.net> wrote:

>> I have a parsing speed issue that is causing me some problems.
>I am parsing
>> plain text files in the multi-Gigabyte size range, and I am
>using the
>> following code;
>>
>> Line Input #sngSourceFile, mvarTraceLine
>>
>> Problem is that if the file I am parsing is created in UNIX, the
>vbCrLf is
>> not picked up and therefore the line being read is not the line

Do a Google search for:   "J French" cReadFileStream

I've posted a VB Class that reads a delineated file a few times

- just add a property to change the delineator and you'll have a fast
generic alternative to Line Input that works on whatever you want.

Also, watch out, if your files go over 2gb then you are into using
APIs for file access.
Author
21 Oct 2005 8:36 AM
Neale
Thanks for your input Rick,

As a result of your information, I am looking at the possibility of my
customer changing their unix settings to provide the CrLf that I require.  At
the end of the day, they are the ones complaining about speed, so if there is
something they can do it's worth looking into.

Failing that, I suppose I can also read the file in large chunks into memory
and process numerous lines from memory splitting on the Lf character. 
Another method I was considering is using a fixed length string to build up
each line so that VB doesn't have to chuck out the old string each time I
concatenate.

If anybody has any other suggestions, I would appreciate your input.

Show quoteHide quote
"Rick Rothstein [MVP - Visual Basic]" wrote:

> > I have a parsing speed issue that is causing me some problems.
> I am parsing
> > plain text files in the multi-Gigabyte size range, and I am
> using the
> > following code;
> >
> > Line Input #sngSourceFile, mvarTraceLine
> >
> > Problem is that if the file I am parsing is created in UNIX, the
> vbCrLf is
> > not picked up and therefore the line being read is not the line
> I expect.
>
> UNIX uses the Line Feed character by itself as its newline
> delimiter (and as a point of information, MACs use the Carriage
> Return by itself) and, as you already know, VB (actually Windows)
> uses the combination of a Carriage Return followed by a Line Feed.
>
> > So my question is - is there any way to change the line
> delimiter used for
> > "Line Input" so that it recognises the correct end-of-line?
>
> Not to my knowledge. Back in the mid-1980s when I worked in UNIX I
> seem to remember that our network server software took care of the
> translations of the newline character sequences whenever Windows
> retrieved a file across the network. I'm not really sure what to
> tell you for your problem though. If this were a smaller file, you
> could read the entire file in using Binary read mode and Split on
> the vbLf character, but that won't work for a multi-Gig file.
> Perhaps you could read in the file in smaller, more managable
> "chunks" (say 5 or 10 Megs) using a Binary read mode and Split
> these "chunks" on the vbLf character. Of course, you would have to
> handle split "fields" for those chunks (the majority of them I
> would image) that didn't fall on a Line Feed boundary.
>
> Rick
>
>
>
Author
21 Oct 2005 10:45 AM
Jim Mack
Neale wrote:
Show quoteHide quote
> Thanks for your input Rick,
>
> As a result of your information, I am looking at the possibility of my
> customer changing their unix settings to provide the CrLf that I
> require.  At the end of the day, they are the ones complaining about
> speed, so if there is something they can do it's worth looking into.
>
> Failing that, I suppose I can also read the file in large chunks into
> memory and process numerous lines from memory splitting on the Lf
> character. Another method I was considering is using a fixed length
> string to build up each line so that VB doesn't have to chuck out the
> old string each time I concatenate.
>
> If anybody has any other suggestions, I would appreciate your input.

It can be much easier.

Using Get / Put with a byte array in binary mode, you can just replace all LF characters with CRs. This will not change the size of the file. so it's quite fast  Then VB's Line Input will work as expected -- VB does not require CRLF as its delimiter, it uses CR and ignores any following LF.

Ask if you need to see VB code to do the conversion (maybe 30 lines).

As an aside, our Stamina library has (among its 450+ functions, written in ASM for VB and VBA) a function to switch the EOL delimiter in any size text file.  In this case the UNIX --> Mac option would do the trick.

--

    Jim Mack
    MicroDexterity Inc
    www.microdexterity.com
Author
21 Oct 2005 10:22 AM
GrandNagel
Neale wrote:
> I have a parsing speed issue that is causing me some problems.  I am parsing
>         strFileChar = Input(1, #sngTraceFileNum)

What method for opening the file?  Binary, or Input, or Random?

Native speed gains can be achieved by opening the file for Binary and 'Get'ting the data from the file into a
pre-allocated buffer; I tend to use a Byte() buffer, but a string can be used as well...  Strings are heavy and Bytes
are handled more like integers which are almost as quick as using longs.

The example below takes 7 minutes to write a 2 gig file...
Then, it takes 3.1 seconds to read it back in, and
If you must convert it to strings you'll see a dramatic increase in execution time.
I'd recommend just working with the byte data directly if you can.

Option Explicit

Const HOW_MANY = 83333333    ' 2.1 GB

Private Sub Form_Load()
    Dim bArray() As Byte
    Dim t_from_file As String
    Dim intFile As Integer
    Dim fn As String
    Dim i As Long

    fn = "c:\res1.gif"

    'HOW_MANY = 83333333
    'Start 8341.963              End 8796.486  Diference 454.523000000001
    ' a 2 gig file can be created and read back in in 7 +/- minutes

    'This reads the file in...
    intFile = FreeFile(0)
    Open fn For Binary As #intFile
    ReDim bArray(83333333)
    Get #intFile, , bArray
    Close #intFile

    ' 83333333 Length read
    ' HOW_MANY = 83333333
    ' Start 9224.558              End 9227.741  Diference 3.18299999999908
    ' using this method I can read the entire file into a byte array in 3.2 seconds.

End Sub


You can deal with the raw byte data or deal with the waste of time to convert it to a string.

i.e. for strings:
    Dim a$
    a$ = StrConv(bArray, vbUnicode)
    Dim AllLines$()
    AllLines = Split(a$, vbLf)
    For i = 0 To UBound(AllLines)
      Debug.Print AllLines(i)
    Next i

This is slow.  Using the CString class implementations that are floating around might speed things up ALOT.

If you stick to the byte array, make a do loop, Look for chr(10), and if you need to, keep head and foot variables so
you know what constitutes a single line that you're working with.  It should be much faster, but remember you're dealing
with a single byte ascii representation of the characters of the file.  If it's a unicode file then some of this may
need to change a bit because of the double byte characters used in unicode.

This still converts the chars from ASCII value to a string  but it shows how to deal with a byte array a little bit.

    Dim spot As Long
    spot = 0

    Do While spot < UBound(bArray)
        If bArray(spot) = 10 Then
            Debug.Print vbCrLf;
        Else
            Debug.Print Chr$(bArray(spot));
        End If

        spot = spot + 1
    Loop


anyway... you may know some or all of this already...  It's late and my buzz is wearing off...

hth,

D.