|
code
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Efficient way to parse large filesfiles with: open FileName for input as #1 I then read line by line and parse parts of the line: line input #1, temp "parse temp for what I need" The parsed output is stored in a variable until the entire input file is read whole = whole + ParsedText + vbcrlf Then the output is written to a file: open OutFile for output as #2. print #2, whole The problem I am running into is the input files are getting larger {250 megs} and the processing time is getting exponentially larger. Is the above way for parsing a file efficient or is there a more efficient way to handle parsing these large files. The code is relativily simple and works well on smaller files, but the large files are killing me. Any tips on how to optimize it? Thanks You may gain some speed by reading the entire file into memory
Public Function ReadFileBinary(ByVal sFilePath As String) As String Dim hFile As Long hFile = FreeFile Open sFilePath For Binary As #hFile ReadFileBinary = String$(LOF(hFile), Chr$(0)) Get #hFile, , ReadFileBinary Close #hFile End Function Usage: Dim Txt As String Txt = ReadFileBinary("c:\somepath\somefile.txt") Then you can use the Split function to get an array to parse Dim Arr() As String Dim i As Long Arr = Split(Txt, vbNewLine) For i = 0 to UBound(Arr) 'do your parsing Next BTW, > whole = whole + ParsedText + vbcrlf should be changed towhole = whole & ParsedText & vbcrlf Show quoteHide quote "vm" <v*@discussions.microsoft.com> wrote in message news:BA78CE8D-66F0-43FB-B208-58C087003F55@microsoft.com... > I use VB6 to parse large text files. Currently I open locally saved text > files with: > open FileName for input as #1 > > I then read line by line and parse parts of the line: > line input #1, temp > "parse temp for what I need" > > The parsed output is stored in a variable until the entire input file is read > whole = whole + ParsedText + vbcrlf > > Then the output is written to a file: > open OutFile for output as #2. > print #2, whole > > The problem I am running into is the input files are getting larger {250 > megs} and the processing time is getting exponentially larger. > > Is the above way for parsing a file efficient or is there a more efficient > way to handle parsing these large files. The code is relativily simple and > works well on smaller files, but the large files are killing me. Any tips on > how to optimize it? > > Thanks I have been using VB since version 3 and have always avioded reading files as
binary, not sure why. I tried your suggestion and there is a speed difference. I did a test on the same file {3 megs & 13000 lines}. My inefficient way took 3.5 minutes and your suggestion took just over 2 minutes. I am not sure what it will do on the larger files, but it has to be better than I am doing now. I will keep trying to find the best way, but thank you for this. It is a big step in the right direction. vm Show quoteHide quote "Norm Cook" wrote: > You may gain some speed by reading the entire file into memory > > Public Function ReadFileBinary(ByVal sFilePath As String) As String > Dim hFile As Long > hFile = FreeFile > Open sFilePath For Binary As #hFile > ReadFileBinary = String$(LOF(hFile), Chr$(0)) > Get #hFile, , ReadFileBinary > Close #hFile > End Function > > Usage: > Dim Txt As String > Txt = ReadFileBinary("c:\somepath\somefile.txt") > > Then you can use the Split function to get an array to parse > > Dim Arr() As String > Dim i As Long > Arr = Split(Txt, vbNewLine) > For i = 0 to UBound(Arr) > 'do your parsing > Next > > BTW, > > whole = whole + ParsedText + vbcrlf > should be changed to > whole = whole & ParsedText & vbcrlf > > "vm" <v*@discussions.microsoft.com> wrote in message > news:BA78CE8D-66F0-43FB-B208-58C087003F55@microsoft.com... > > I use VB6 to parse large text files. Currently I open locally saved text > > files with: > > open FileName for input as #1 > > > > I then read line by line and parse parts of the line: > > line input #1, temp > > "parse temp for what I need" > > > > The parsed output is stored in a variable until the entire input file is > read > > whole = whole + ParsedText + vbcrlf > > > > Then the output is written to a file: > > open OutFile for output as #2. > > print #2, whole > > > > The problem I am running into is the input files are getting larger {250 > > megs} and the processing time is getting exponentially larger. > > > > Is the above way for parsing a file efficient or is there a more efficient > > way to handle parsing these large files. The code is relativily simple and > > works well on smaller files, but the large files are killing me. Any tips > on > > how to optimize it? > > > > Thanks > > > "vm" <v*@discussions.microsoft.com> wrote You might want to share some of that parsing code. It may be that a> I have been using VB since version 3 and have always avioded reading files as > binary, not sure why. I tried your suggestion and there is a speed > difference. I did a test on the same file {3 megs & 13000 lines}. My > inefficient way took 3.5 minutes and your suggestion took just over 2 > minutes. I am not sure what it will do on the larger files, but it has to be > better than I am doing now. > > I will keep trying to find the best way, but thank you for this. It is a big > step in the right direction. significant increase can be found there as well. As you know, even a small savings can add up when multiplied by some large number of iterations. You may already have a fast algoritm, but it can't hurt to get another set of eyes on the problem (in this case you get several sets in one go... ;-) LFS "vm" <v*@discussions.microsoft.com> wrote in message Have you not tried my suggestion of using the Mid$ statement (not the Mid$ news:280C39A1-79F0-45D3-B906-804D0458C4F5@microsoft.com... > I will keep trying to find the best way, but thank you for this. > [twice the speed] It is a big step in the right direction. function) on a long strings (rather than using repeated string concatenation)? That should speed increases orders of magnitude better than you are getting now! Have you tried it? Mike I am testing the MID$ suggestion you made. I have not done it that way
before. I will have a seperate app doing it that way to benchmark the speed difference. Thanks vm Show quoteHide quote "Mike Williams" wrote: > "vm" <v*@discussions.microsoft.com> wrote in message > news:280C39A1-79F0-45D3-B906-804D0458C4F5@microsoft.com... > > > I will keep trying to find the best way, but thank you for this. > > [twice the speed] It is a big step in the right direction. > > Have you not tried my suggestion of using the Mid$ statement (not the Mid$ > function) on a long strings (rather than using repeated string > concatenation)? That should speed increases orders of magnitude better than > you are getting now! Have you tried it? > > Mike > > > > > "vm" <v*@discussions.microsoft.com> wrote in message You haven't said what it is you're looking for in the file so its hard to news:BA78CE8D-66F0-43FB-B208-58C087003F55@microsoft.com... > The parsed output is stored in a variable until the entire input file is > read > whole = whole + ParsedText + vbcrlf give a definite answer but it may be faster if you open the file as Binary and use Get to read and digest large chunks of it as many times as is required. However, one thing you are currently doing that is *definitely* slowing your code down is your repeated concatenation of the output string. Thnat is *very* slow, especially as the string grows in length, because in order to add each substring to the main string VB has to create another string entirely and then throw the original away. It would be far quicker to initially create a very long "output string" and then maintain a simple "pointer" telling you where to insert the next substring. Something like: Dim whole As String, ParsedText As String Dim p1 As Long, p2 As Long whole = Space$(500000) ' initialise output string p1 = 1 ' initialise pointer ' then each time you want to add substring you could do: p2 = Len(ParsedText) If (Len(whole) - p1) < (p1 + 2) Then whole = whole & Space$(500000) End If Mid$(whole, p1) = ParsedText ' this is very fast Mid$(whole, p1 + p2) = vbCrLf ' and so is this p1 = p1 + p2 + 2 Then when you have finsihed you trim the output strting to its correct size4 in accordance with the value held in p1. Mike Show quoteHide quote > > Then the output is written to a file: > open OutFile for output as #2. > print #2, whole > > The problem I am running into is the input files are getting larger {250 > megs} and the processing time is getting exponentially larger. > > Is the above way for parsing a file efficient or is there a more efficient > way to handle parsing these large files. The code is relativily simple and > works well on smaller files, but the large files are killing me. Any tips > on > how to optimize it? > > Thanks On Tue, 18 Oct 2005 05:43:05 -0700, "=?Utf-8?B?dm0=?="
<v*@discussions.microsoft.com> wrote: >I use VB6 to parse large text files. Currently I open locally saved text Never use #1 - always use FreeFile>files with: <snip> >Is the above way for parsing a file efficient or is there a more efficient Here is a VB Class that will do what you want>way to handle parsing these large files. The code is relativily simple and >works well on smaller files, but the large files are killing me. Any tips on >how to optimize it? - and rather faster Disk access is slow. VERSION 1.0 CLASS BEGIN MultiUse = -1 'True END Attribute VB_Name = "cReadFileStream" Attribute VB_GlobalNameSpace = False Attribute VB_Creatable = True Attribute VB_PredeclaredId = False Attribute VB_Exposed = False Option Explicit ' 2/8/01 JF ' 3/8/01 JF - Block Read Added - watch for Block > File Size ' Private Type TCMN FileName As String FileSize As Long Delin As String Buffer As String BufferLen As Long BufferPos As Long BytesDone As Long EofFlag As Boolean Channel As Integer End Type Private cmn As TCMN ' --- Private Sub Class_Initialize() cmn.Delin = vbCrLf cmn.BufferLen = 100000 End Sub ' --- Public Function Create(FileName$) As Boolean cmn.FileName = FileName Create = False cmn.Buffer = "" cmn.Channel = 0 cmn.EofFlag = False cmn.BufferPos = 1 cmn.BytesDone = 0 ' --- If FileExists(FileName$) = False Then MsgBox "cReadFileStream: " + FileName$ _ + "File not Found" Exit Function End If ' --- If FileExists(FileName$) Then cmn.FileSize = FileLen(cmn.FileName) cmn.Channel = FreeFile Open FileName For Binary Access Read As #cmn.Channel Create = True End If End Function ' --- Public Function ReadDelineatedLine() As String Dim Q&, L& If cmn.Channel = 0 Then MsgBox "cReadFileStream - ReadLine - but file not Open" cmn.EofFlag = True Exit Function End If ' --- If cmn.EofFlag Then MsgBox "cReadFileStream - Read Past End of File" Exit Function End If ' --- If InStr(cmn.BufferPos, cmn.Buffer, cmn.Delin) = 0 Then Call LS_FillBuffer ' --- When File completely Read then append Delin if Needed If cmn.BytesDone = cmn.FileSize Then If Right$(cmn.Buffer, Len(cmn.Delin)) <> cmn.Delin Then cmn.Buffer = cmn.Buffer + cmn.Delin End If End If End If ' --- Q = InStr(cmn.BufferPos, cmn.Buffer, cmn.Delin) If Q Then L = Q - cmn.BufferPos ReadDelineatedLine = Mid$(cmn.Buffer, cmn.BufferPos, L) cmn.BufferPos = Q + Len(cmn.Delin) End If If Q = 0 Then MsgBox "cReadFileStream - Read - Unexpected Error" _ + vbCrLf + "Delineator not Found" End If ' --- Was this the last Field of the Last Buffer If cmn.BytesDone >= cmn.FileSize Then If Q >= Len(cmn.Buffer) - Len(cmn.Delin) Then cmn.EofFlag = True End If End If End Function ' --- Public Sub ReadBlock(Block$) Dim BlockLen&, Q& If cmn.Channel = 0 Then MsgBox "cReadFileStream - ReadBlock - but file not Open" cmn.EofFlag = True Exit Sub End If ' --- If cmn.EofFlag Then MsgBox "cReadFileStream - Read Past End of File" Exit Sub End If ' --- BlockLen& = Len(Block$) ' --- Do we need to fill the Buffer If (cmn.BufferPos + BlockLen) > Len(cmn.Buffer) Then If BlockLen > cmn.BufferLen Then ' increase buffer size cmn.BufferLen = cmn.BufferPos + BlockLen End If Call LS_FillBuffer End If ' --- If insufficient Data left Q = Len(cmn.Buffer$) - cmn.BufferPos + 1 ' Bytes Left If BlockLen > Q Then Block$ = Space$(Q) BlockLen = Q End If ' --- Copy the data Mid$(Block$, 1, BlockLen) = Mid$(cmn.Buffer$, cmn.BufferPos, BlockLen) cmn.BufferPos = cmn.BufferPos + BlockLen ' --- Was this the last Field of the Last Buffer If cmn.BytesDone >= cmn.FileSize Then If cmn.BufferPos > Len(cmn.Buffer$) Then cmn.EofFlag = True End If End If End Sub ' --- Public Function EofFlag() As Boolean EofFlag = cmn.EofFlag End Function ' --- Public Function Size() As Long Size = cmn.FileSize End Function ' --- Public Sub Free() If cmn.Channel <> 0 Then Close #cmn.Channel cmn.Channel = 0 End If End Sub ' --- Private Sub LS_FillBuffer() Dim Hold$, Q& ' --- First time in cmn.Buffer = "" Hold$ = Mid$(cmn.Buffer, cmn.BufferPos) If cmn.BytesDone >= cmn.FileSize Then Exit Sub End If ' --- If Len(cmn.Buffer) < cmn.BufferLen Then cmn.Buffer = Space$(cmn.BufferLen) End If ' --- Reduce Buffer Size at End of File Q = cmn.FileSize - cmn.BytesDone If Q < Len(cmn.Buffer) Then cmn.Buffer = Space$(Q) End If ' --- Read a Chunk Get #cmn.Channel, cmn.BytesDone + 1, cmn.Buffer cmn.BytesDone = cmn.BytesDone + Len(cmn.Buffer) ' --- Add leftover chunk if needed If Len(Hold$) Then cmn.Buffer = Hold + cmn.Buffer End If ' --- cmn.BufferPos = 1 End Sub Private Sub Class_Terminate() Me.Free End Sub ' ' Support Routines ' Function FileExists(Fle$) As Boolean Dim Q% On Error Resume Next Q = GetAttr(Fle$) If Err = 0 Then If (Q And vbDirectory) = 0 Then FileExists = True End If End If Err.Clear End Function > I use VB6 to parse large text files. Can you describe your parsing operation to us? Are you simplyreplacing text for text or is there more to it than that? Rick I am actually going through large log files and extracting what I need. I can
go from 250 megs and parse it down to about 10 megs of relevant data. The actual parsing uses instr, right$, & left$ to pull specific bits of info. Each line that contains a target string is then parsed into specific fields for that row {record}. This project kind of took off from a small "good idea" and because of the file size is in need of optimization. Show quoteHide quote "Rick Rothstein [MVP - Visual Basic]" wrote: > > I use VB6 to parse large text files. > > Can you describe your parsing operation to us? Are you simply > replacing text for text or is there more to it than that? > > Rick > > > "vm" <v*@discussions.microsoft.com> wrote in message news:7165EC72-34AC-4AB3-8A8B-19131181298B@microsoft.com... Can I suggest you simply open both files, one for input and one for output, > I am actually going through large log files and extracting what I need. I can > go from 250 megs and parse it down to about 10 megs of relevant data. The > actual parsing uses instr, right$, & left$ to pull specific bits of info. > Each line that contains a target string is then parsed into specific fields > for that row {record}. This project kind of took off from a small "good idea" > and because of the file size is in need of optimization. and proceed as before, only write out the parsed text immediately, instead of saving it to a string. (adjusted repost of your initial algorithm: ) Was: open FileName for input as #1 line input #1, temp "parse temp for what I need" whole = whole + ParsedText + vbcrlf Then the output is written to a file: open OutFile for output as #2. print #2, whole Try: open FileName for input as #1 open OutFile for output as #2. line input #1, temp "parse temp for what I need" print #2, ParsedText Its a minimum amount of change from what you have and should offer some savings in speed. Do take note of Rick's post about using FreeFile. It actually should look more like: FileIn = FreeFile Open InputFile For Input As FileIn Len = 4096 FileOut = FreeFile Open OutputFile For Output As FileOut Len = 16384 Do While Not EOF(FileIn) Line Input #FileIn, temp Print FileOut, ParsedText(temp) Loop Close (Where ParsedText is a function that accepts the input lines and returns the desired output from that text....) HTH LFS "Larry Serflaten" <serfla***@usinternet.com> wrote My mistake, it was J French's post....> ... Do take note of Rick's post about > using FreeFile. Sorry for the confusion! LFS Some years ago I wrote a tip of the month that I call 'Using ADO to Read and
Parse a Text File'. Link to http://www.buygold.net/tips then look for the April 2002 tip of the month. A sample program is provided. From the tip's intro: ADO [Active Data Objects] can be used to read a variety of file formats. I have used it to read and parse a CSV and fixed width formatted text file. ADO reads and parses the file into a recordset. When I first needed to parse a text file I tried various routines - some provided via other users - and none seemed satisfactory. So I did some research and discovered that ADO will perform the task. I've revisited the code, wrote a demo program and made it the April 2002 tip-of-the-month. You have to read a block of text, say 4096 bytes, then divide it into lines.
You have to account for small files and the last block because it usually less than 4096 bytes. You also have to account for partial lines at the end of a block and join them with the next block. Show quoteHide quote "vm" <v*@discussions.microsoft.com> wrote in message news:BA78CE8D-66F0-43FB-B208-58C087003F55@microsoft.com... >I use VB6 to parse large text files. Currently I open locally saved text > files with: > open FileName for input as #1 > > I then read line by line and parse parts of the line: > line input #1, temp > "parse temp for what I need" > > The parsed output is stored in a variable until the entire input file is > read > whole = whole + ParsedText + vbcrlf > > Then the output is written to a file: > open OutFile for output as #2. > print #2, whole > > The problem I am running into is the input files are getting larger {250 > megs} and the processing time is getting exponentially larger. > > Is the above way for parsing a file efficient or is there a more efficient > way to handle parsing these large files. The code is relativily simple and > works well on smaller files, but the large files are killing me. Any tips > on > how to optimize it? > > Thanks Try this untested routine. Put your parsing code in ProcessLine function. It
reads 32768 bytes at a time and calls ProcessLine for each line it encounters. This routine does not work correctly if another process expanded or truncated the file while you are parsing it. Public Function ProcessFile(ByRef FileName As String) As Long Dim f As Integer Dim d32768 As Long ' Number of 32768 bytes blocks, including partial blocks Dim m32768 As Long ' Number of bytes in the last or only block Dim i As Long Dim LastEndOfLinePos As Long Dim NextEndOfLinePos As Long Dim Buffer As String Dim sLine As String ' Allocate the buffer Buffer = String(32768, 0) f = FreeFile Open FileName For Binary As #f d32768 = LOF(f) \ 32768 + 1 m32768 = LOF(f) Mod 32768 For i = 1 To d32768 If i = d32768 Then ' Adjust the buffer for the last or the only block Buffer = String(1, m32768) End If Get f, , Buffer Do While NextEndOfLinePos <> 0 NextEndOfLinePos = InStr(LastEndOfLinePos + 1, Buffer, vbCrLf, vbBinaryCompare) If NextEndOfLinePos = 0 Then ' Partial line found at the end of the block sLine = Mid(Buffer, LastEndOfLinePos + 2) Else sLine = sLine & Mid(Buffer, LastEndOfLinePos + 2, LastEndOfLinePos - NextEndOfLinePos) ' Process the line ProcessLine sLine sLine = "" ' Get ready for the next line End If LastEndOfLinePos = NextEndOfLinePos Loop Next If sLine <> "" Then ' Last line in the file did not have CrLF ProcessLine sLine End If Close f End Function Private Function ProcessLine(ByRef sLine As String) As Long End Function Show quoteHide quote "vm" <v*@discussions.microsoft.com> wrote in message news:BA78CE8D-66F0-43FB-B208-58C087003F55@microsoft.com... >I use VB6 to parse large text files. Currently I open locally saved text > files with: > open FileName for input as #1 > > I then read line by line and parse parts of the line: > line input #1, temp > "parse temp for what I need" > > The parsed output is stored in a variable until the entire input file is > read > whole = whole + ParsedText + vbcrlf > > Then the output is written to a file: > open OutFile for output as #2. > print #2, whole > > The problem I am running into is the input files are getting larger {250 > megs} and the processing time is getting exponentially larger. > > Is the above way for parsing a file efficient or is there a more efficient > way to handle parsing these large files. The code is relativily simple and > works well on smaller files, but the large files are killing me. Any tips > on > how to optimize it? > > Thanks vm wrote:
> The parsed output is stored in a variable until the entire input file Mike Williams probably nailed it. Concatenation is a killer. Especially building up> is read whole = whole + ParsedText + vbcrlf to a 10Mb file! (I think I saw you say that, somewhere.) Anyway, you need to either adopt his Mid$ suggestion, or take it to the next level and use something like http://vb.mvps.org/samples/StrBldr |
|||||||||||||||||||||||