|
code
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Parsing UNIX text filesI have a parsing speed issue that is causing me some problems. I am parsing plain text files in the multi-Gigabyte size range, and I am using the following code; Line Input #sngSourceFile, mvarTraceLine Problem is that if the file I am parsing is created in UNIX, the vbCrLf is not picked up and therefore the line being read is not the line I expect. Initially to solve this issue I have been reading lines char by char using the following code; Do While True strFileChar = Input(1, #sngTraceFileNum) If Asc(strFileChar) = 10 Then ' A line feed character has been found Exit Do Else ' Build up the line string strLineText = strLineText & strFileChar End If Loop Of course, this takes about 10,000 years!!! So my question is - is there any way to change the line delimiter used for "Line Input" so that it recognises the correct end-of-line? Thanks for any suggestions. > I have a parsing speed issue that is causing me some problems. I am parsing> plain text files in the multi-Gigabyte size range, and I am UNIX uses the Line Feed character by itself as its newlineusing the > following code; > > Line Input #sngSourceFile, mvarTraceLine > > Problem is that if the file I am parsing is created in UNIX, the vbCrLf is > not picked up and therefore the line being read is not the line I expect. delimiter (and as a point of information, MACs use the Carriage Return by itself) and, as you already know, VB (actually Windows) uses the combination of a Carriage Return followed by a Line Feed. > So my question is - is there any way to change the line delimiter used for> "Line Input" so that it recognises the correct end-of-line? Not to my knowledge. Back in the mid-1980s when I worked in UNIX Iseem to remember that our network server software took care of the translations of the newline character sequences whenever Windows retrieved a file across the network. I'm not really sure what to tell you for your problem though. If this were a smaller file, you could read the entire file in using Binary read mode and Split on the vbLf character, but that won't work for a multi-Gig file. Perhaps you could read in the file in smaller, more managable "chunks" (say 5 or 10 Megs) using a Binary read mode and Split these "chunks" on the vbLf character. Of course, you would have to handle split "fields" for those chunks (the majority of them I would image) that didn't fall on a Line Feed boundary. Rick On Fri, 21 Oct 2005 03:58:39 -0400, "Rick Rothstein [MVP - Visual
Basic]" <rickNOSPAMnews@NOSPAMcomcast.net> wrote: >> I have a parsing speed issue that is causing me some problems. Do a Google search for: "J French" cReadFileStream>I am parsing >> plain text files in the multi-Gigabyte size range, and I am >using the >> following code; >> >> Line Input #sngSourceFile, mvarTraceLine >> >> Problem is that if the file I am parsing is created in UNIX, the >vbCrLf is >> not picked up and therefore the line being read is not the line I've posted a VB Class that reads a delineated file a few times - just add a property to change the delineator and you'll have a fast generic alternative to Line Input that works on whatever you want. Also, watch out, if your files go over 2gb then you are into using APIs for file access. Thanks for your input Rick,
As a result of your information, I am looking at the possibility of my customer changing their unix settings to provide the CrLf that I require. At the end of the day, they are the ones complaining about speed, so if there is something they can do it's worth looking into. Failing that, I suppose I can also read the file in large chunks into memory and process numerous lines from memory splitting on the Lf character. Another method I was considering is using a fixed length string to build up each line so that VB doesn't have to chuck out the old string each time I concatenate. If anybody has any other suggestions, I would appreciate your input. Show quoteHide quote "Rick Rothstein [MVP - Visual Basic]" wrote: > > I have a parsing speed issue that is causing me some problems. > I am parsing > > plain text files in the multi-Gigabyte size range, and I am > using the > > following code; > > > > Line Input #sngSourceFile, mvarTraceLine > > > > Problem is that if the file I am parsing is created in UNIX, the > vbCrLf is > > not picked up and therefore the line being read is not the line > I expect. > > UNIX uses the Line Feed character by itself as its newline > delimiter (and as a point of information, MACs use the Carriage > Return by itself) and, as you already know, VB (actually Windows) > uses the combination of a Carriage Return followed by a Line Feed. > > > So my question is - is there any way to change the line > delimiter used for > > "Line Input" so that it recognises the correct end-of-line? > > Not to my knowledge. Back in the mid-1980s when I worked in UNIX I > seem to remember that our network server software took care of the > translations of the newline character sequences whenever Windows > retrieved a file across the network. I'm not really sure what to > tell you for your problem though. If this were a smaller file, you > could read the entire file in using Binary read mode and Split on > the vbLf character, but that won't work for a multi-Gig file. > Perhaps you could read in the file in smaller, more managable > "chunks" (say 5 or 10 Megs) using a Binary read mode and Split > these "chunks" on the vbLf character. Of course, you would have to > handle split "fields" for those chunks (the majority of them I > would image) that didn't fall on a Line Feed boundary. > > Rick > > > Neale wrote:
Show quoteHide quote > Thanks for your input Rick, It can be much easier.> > As a result of your information, I am looking at the possibility of my > customer changing their unix settings to provide the CrLf that I > require. At the end of the day, they are the ones complaining about > speed, so if there is something they can do it's worth looking into. > > Failing that, I suppose I can also read the file in large chunks into > memory and process numerous lines from memory splitting on the Lf > character. Another method I was considering is using a fixed length > string to build up each line so that VB doesn't have to chuck out the > old string each time I concatenate. > > If anybody has any other suggestions, I would appreciate your input. Using Get / Put with a byte array in binary mode, you can just replace all LF characters with CRs. This will not change the size of the file. so it's quite fast Then VB's Line Input will work as expected -- VB does not require CRLF as its delimiter, it uses CR and ignores any following LF. Ask if you need to see VB code to do the conversion (maybe 30 lines). As an aside, our Stamina library has (among its 450+ functions, written in ASM for VB and VBA) a function to switch the EOL delimiter in any size text file. In this case the UNIX --> Mac option would do the trick. Neale wrote:
> I have a parsing speed issue that is causing me some problems. I am parsing What method for opening the file? Binary, or Input, or Random?> strFileChar = Input(1, #sngTraceFileNum) Native speed gains can be achieved by opening the file for Binary and 'Get'ting the data from the file into a pre-allocated buffer; I tend to use a Byte() buffer, but a string can be used as well... Strings are heavy and Bytes are handled more like integers which are almost as quick as using longs. The example below takes 7 minutes to write a 2 gig file... Then, it takes 3.1 seconds to read it back in, and If you must convert it to strings you'll see a dramatic increase in execution time. I'd recommend just working with the byte data directly if you can. Option Explicit Const HOW_MANY = 83333333 ' 2.1 GB Private Sub Form_Load() Dim bArray() As Byte Dim t_from_file As String Dim intFile As Integer Dim fn As String Dim i As Long fn = "c:\res1.gif" 'HOW_MANY = 83333333 'Start 8341.963 End 8796.486 Diference 454.523000000001 ' a 2 gig file can be created and read back in in 7 +/- minutes 'This reads the file in... intFile = FreeFile(0) Open fn For Binary As #intFile ReDim bArray(83333333) Get #intFile, , bArray Close #intFile ' 83333333 Length read ' HOW_MANY = 83333333 ' Start 9224.558 End 9227.741 Diference 3.18299999999908 ' using this method I can read the entire file into a byte array in 3.2 seconds. End Sub You can deal with the raw byte data or deal with the waste of time to convert it to a string. i.e. for strings: Dim a$ a$ = StrConv(bArray, vbUnicode) Dim AllLines$() AllLines = Split(a$, vbLf) For i = 0 To UBound(AllLines) Debug.Print AllLines(i) Next i This is slow. Using the CString class implementations that are floating around might speed things up ALOT. If you stick to the byte array, make a do loop, Look for chr(10), and if you need to, keep head and foot variables so you know what constitutes a single line that you're working with. It should be much faster, but remember you're dealing with a single byte ascii representation of the characters of the file. If it's a unicode file then some of this may need to change a bit because of the double byte characters used in unicode. This still converts the chars from ASCII value to a string but it shows how to deal with a byte array a little bit. Dim spot As Long spot = 0 Do While spot < UBound(bArray) If bArray(spot) = 10 Then Debug.Print vbCrLf; Else Debug.Print Chr$(bArray(spot)); End If spot = spot + 1 Loop anyway... you may know some or all of this already... It's late and my buzz is wearing off... hth, D. |
|||||||||||||||||||||||