|
code
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Convert HTML to textHi
I'm looking for a converter from HTML to text. But not a simple one, or even a regular expression, but one that can convert also complicated pages, with embedded styles, dynamic content etc. Does anybody know such tool? I know, that Internet Explorer can load a web page, and dump it in both html/text format - but I need something faster, and without need to downloading the page - just converting... Maybe do you know, how to use IE for it? Thanks in advance! HTML is Text. You rule out regular expressions but in your case that may
be the best, fastest, and easiest way to extract out all HTML tags from a webpage. Also, you say without downloading a page. That's how the web works. You are not viewing the contents of the page on the server but a local copy that has been downloaded to your system from the server - no way around it. -- Show quoteHide quoteChris Hanscom - Microsoft MVP (VB) Veign's Resource Center http://www.veign.com/vrc_main.asp Veign's Blog http://www.veign.com/blog -- "Tomasz Klim" <tklim***@komputronik.pl> wrote in message news:%23LKkMPgTGHA.3976@TK2MSFTNGP10.phx.gbl... > Hi > > I'm looking for a converter from HTML to text. But not a simple one, or > even a regular expression, but one that can convert also complicated > pages, with embedded styles, dynamic content etc. > > Does anybody know such tool? > > I know, that Internet Explorer can load a web page, and dump it in both > html/text format - but I need something faster, and without need to > downloading the page - just converting... Maybe do you know, how to use IE > for it? > > Thanks in advance! > > > HTML is Text. You rule out regular expressions but in your case that may Yes, I know, but when the page contains "advanced" content, like embedded > be the best, fastest, and easiest way to extract out all HTML tags from a > webpage. CSS, or even JS, regular expressions won't remove it. Furthermore, I want to preserve all links from the original html code. > Also, you say without downloading a page. That's how the web works. You You misunderstood me. I tried to say, that I have a page already on my local > are not viewing the contents of the page on the server but a local copy > that has been downloaded to your system from the server - no way around > it. disk. So? Show quoteHide quote U¿ytkownik "Veign" <NOSPAMinveign@veign.com> napisa³ w wiadomo¶ci news:elYTz1hTGHA.2276@tk2msftngp13.phx.gbl... > HTML is Text. You rule out regular expressions but in your case that may > be the best, fastest, and easiest way to extract out all HTML tags from a > webpage. > > Also, you say without downloading a page. That's how the web works. You > are not viewing the contents of the page on the server but a local copy > that has been downloaded to your system from the server - no way around > it. > > -- > Chris Hanscom - Microsoft MVP (VB) > Veign's Resource Center > http://www.veign.com/vrc_main.asp > Veign's Blog > http://www.veign.com/blog > -- > > > "Tomasz Klim" <tklim***@komputronik.pl> wrote in message > news:%23LKkMPgTGHA.3976@TK2MSFTNGP10.phx.gbl... >> Hi >> >> I'm looking for a converter from HTML to text. But not a simple one, or >> even a regular expression, but one that can convert also complicated >> pages, with embedded styles, dynamic content etc. >> >> Does anybody know such tool? >> >> I know, that Internet Explorer can load a web page, and dump it in both >> html/text format - but I need something faster, and without need to >> downloading the page - just converting... Maybe do you know, how to use >> IE for it? >> >> Thanks in advance! >> >> > > Tomasz
Goto Google and search the newsgroups with this. html, text, larry serflaten George Show quoteHide quote "Tomasz Klim" <tklim***@komputronik.pl> wrote in message news:eELumRlTGHA.1576@tk2msftngp13.phx.gbl... >> HTML is Text. You rule out regular expressions but in your case that >> may be the best, fastest, and easiest way to extract out all HTML tags >> from a webpage. > > Yes, I know, but when the page contains "advanced" content, like embedded > CSS, or even JS, regular expressions won't remove it. Furthermore, I want > to preserve all links from the original html code. > >> Also, you say without downloading a page. That's how the web works. You >> are not viewing the contents of the page on the server but a local copy >> that has been downloaded to your system from the server - no way around >> it. > > You misunderstood me. I tried to say, that I have a page already on my > local disk. > > So? > > > > U¿ytkownik "Veign" <NOSPAMinveign@veign.com> napisa³ w wiadomo¶ci > news:elYTz1hTGHA.2276@tk2msftngp13.phx.gbl... >> HTML is Text. You rule out regular expressions but in your case that >> may be the best, fastest, and easiest way to extract out all HTML tags >> from a webpage. >> >> Also, you say without downloading a page. That's how the web works. You >> are not viewing the contents of the page on the server but a local copy >> that has been downloaded to your system from the server - no way around >> it. >> >> -- >> Chris Hanscom - Microsoft MVP (VB) >> Veign's Resource Center >> http://www.veign.com/vrc_main.asp >> Veign's Blog >> http://www.veign.com/blog >> -- >> >> >> "Tomasz Klim" <tklim***@komputronik.pl> wrote in message >> news:%23LKkMPgTGHA.3976@TK2MSFTNGP10.phx.gbl... >>> Hi >>> >>> I'm looking for a converter from HTML to text. But not a simple one, or >>> even a regular expression, but one that can convert also complicated >>> pages, with embedded styles, dynamic content etc. >>> >>> Does anybody know such tool? >>> >>> I know, that Internet Explorer can load a web page, and dump it in both >>> html/text format - but I need something faster, and without need to >>> downloading the page - just converting... Maybe do you know, how to use >>> IE for it? >>> >>> Thanks in advance! >>> >>> >> >> > > "Tomasz Klim" <tklim***@komputronik.pl> wrote: (1) Web-browsers work by sending GET commands to a webserver.> I'm looking for a converter from HTML to text. But not a simple one, or > even a regular expression, but one that can convert also complicated > pages, with embedded styles, dynamic content etc. > > Does anybody know such tool? > > I know, that Internet Explorer can load a web page, and dump it in both > html/text format - but I need something faster, and without need to > downloading the page - just converting... Maybe do you know, how to use > IE for it? (2) You can use a winsock control to do the same thing. (3) You only want the initial html document and not the underlying pictures, so you then parse the downloaded html and possibly grab the appropriate CSS by issuing a GET for those as well. You skip sending out GET commands for all the un-needed things. http://vbip.com/winsock/winsock_http_01.asp http://vbip.com/winsock/winsock_http_08_01.asp There's more information here, scroll down to "WWW and HyperText Transfer Protocol", when you get there, http://vbip.com/winsock/. It's not all that hard to create your own simple web-browser, which grabs text and perhaps css documents for layout. Hope this helps. Jim Carlock Post replies to the group. Not sure he wants to create a web-browser. Seems more like an application
to extract data only from a page. -- Show quoteHide quoteChris Hanscom - Microsoft MVP (VB) Veign's Resource Center http://www.veign.com/vrc_main.asp Veign's Blog http://www.veign.com/blog -- "Jim Carlock" <anonymous@localhost> wrote in message news:%23jvfhSpTGHA.424@TK2MSFTNGP12.phx.gbl... > "Tomasz Klim" <tklim***@komputronik.pl> wrote: >> I'm looking for a converter from HTML to text. But not a simple one, or >> even a regular expression, but one that can convert also complicated >> pages, with embedded styles, dynamic content etc. >> >> Does anybody know such tool? >> >> I know, that Internet Explorer can load a web page, and dump it in both >> html/text format - but I need something faster, and without need to >> downloading the page - just converting... Maybe do you know, how to use >> IE for it? > > (1) Web-browsers work by sending GET commands to a webserver. > (2) You can use a winsock control to do the same thing. > (3) You only want the initial html document and not the underlying > pictures, so you then parse the downloaded html and possibly grab > the appropriate CSS by issuing a GET for those as well. You skip > sending out GET commands for all the un-needed things. > > http://vbip.com/winsock/winsock_http_01.asp > http://vbip.com/winsock/winsock_http_08_01.asp > > There's more information here, scroll down to "WWW and HyperText > Transfer Protocol", when you get there, http://vbip.com/winsock/. > > It's not all that hard to create your own simple web-browser, which > grabs text and perhaps css documents for layout. > > Hope this helps. > > Jim Carlock > Post replies to the group. > > "Veign" <NOSPAMinveign@veign.com> wrote: Did I suggest creating a web-browser ? VB can do this without the> Not sure he wants to create a web-browser. Seems more like > an application to extract data only from a page. use of the HTML controls. It would take alittle more effort to get javascript going and other such things, but does he care about the javascript? Or does he just want the HTML and the <a> tags inside the HTML. Parsing through the css code could get quite involved as well as far as fonts and such goes and that might make things more complicated. A rich-textbox control could help to get better rendering in that case, if rendering was what was wanted. <shrug> It sounds like he wants to extract links and other specific items from the text. Perhaps he's connecting to websites to extract email addresses and spam us all? Ugh oh. How'd that pop into my head? Hmmm, maybe we're providing too much help here. <g> Perhaps he just wants some kind of tool rather than encoding it himself? Hope this helps. Jim Carlock "Jim Carlock" posted: Show quoteHide quote > "Tomasz Klim" <tklim***@komputronik.pl> wrote: >> I'm looking for a converter from HTML to text. But not a simple one, or >> even a regular expression, but one that can convert also complicated >> pages, with embedded styles, dynamic content etc. >> >> Does anybody know such tool? >> >> I know, that Internet Explorer can load a web page, and dump it in both >> html/text format - but I need something faster, and without need to >> downloading the page - just converting... Maybe do you know, how to use >> IE for it? > > (1) Web-browsers work by sending GET commands to a webserver. > (2) You can use a winsock control to do the same thing. > (3) You only want the initial html document and not the underlying > pictures, so you then parse the downloaded html and possibly grab > the appropriate CSS by issuing a GET for those as well. You skip > sending out GET commands for all the un-needed things. > > http://vbip.com/winsock/winsock_http_01.asp > http://vbip.com/winsock/winsock_http_08_01.asp > > There's more information here, scroll down to "WWW and HyperText > Transfer Protocol", when you get there, http://vbip.com/winsock/. > > It's not all that hard to create your own simple web-browser, which > grabs text and perhaps css documents for layout. > > Hope this helps. > > Jim Carlock I took this statement:
"It's not all that hard to create your own simple web-browser, which grabs text and perhaps css documents for layout." All good. Sometimes it hard to answer questions when you get the questions only, out of context of its use... -- Show quoteHide quoteChris Hanscom - Microsoft MVP (VB) Veign's Resource Center http://www.veign.com/vrc_main.asp Veign's Blog http://www.veign.com/blog -- "Jim Carlock" <anonymous@localhost> wrote in message news:uAT0UBqTGHA.5900@tk2msftngp13.phx.gbl... > "Veign" <NOSPAMinveign@veign.com> wrote: >> Not sure he wants to create a web-browser. Seems more like >> an application to extract data only from a page. > > Did I suggest creating a web-browser ? VB can do this without the > use of the HTML controls. It would take alittle more effort to get > javascript going and other such things, but does he care about the > javascript? Or does he just want the HTML and the <a> tags inside > the HTML. > > Parsing through the css code could get quite involved as well as far > as fonts and such goes and that might make things more complicated. > A rich-textbox control could help to get better rendering in that case, > if rendering was what was wanted. <shrug> > > It sounds like he wants to extract links and other specific items > from the text. Perhaps he's connecting to websites to extract email > addresses and spam us all? Ugh oh. How'd that pop into my head? > > Hmmm, maybe we're providing too much help here. <g> > > Perhaps he just wants some kind of tool rather than encoding it > himself? > > Hope this helps. > Jim Carlock > > "Jim Carlock" posted: >> "Tomasz Klim" <tklim***@komputronik.pl> wrote: >>> I'm looking for a converter from HTML to text. But not a simple one, or >>> even a regular expression, but one that can convert also complicated >>> pages, with embedded styles, dynamic content etc. >>> >>> Does anybody know such tool? >>> >>> I know, that Internet Explorer can load a web page, and dump it in both >>> html/text format - but I need something faster, and without need to >>> downloading the page - just converting... Maybe do you know, how to use >>> IE for it? >> >> (1) Web-browsers work by sending GET commands to a webserver. >> (2) You can use a winsock control to do the same thing. >> (3) You only want the initial html document and not the underlying >> pictures, so you then parse the downloaded html and possibly grab >> the appropriate CSS by issuing a GET for those as well. You skip >> sending out GET commands for all the un-needed things. >> >> http://vbip.com/winsock/winsock_http_01.asp >> http://vbip.com/winsock/winsock_http_08_01.asp >> >> There's more information here, scroll down to "WWW and HyperText >> Transfer Protocol", when you get there, http://vbip.com/winsock/. >> >> It's not all that hard to create your own simple web-browser, which >> grabs text and perhaps css documents for layout. >> >> Hope this helps. >> >> Jim Carlock > > "Veign" <NOSPAMinveign@veign.com> wrote: <g> Yeah, I was throwing it out as a suggestion out and going to> I took this statement: > "It's not all that hard to create your own simple web-browser, > which grabs text and perhaps css documents for layout." wait and see what kind of reply came back. Kind of like baiting a fish. We probably won't hear from him until after 5pm EST. I tried the GetStuff item in the link you provided. http://www.veign.com/vrc_codeview.asp?type=app&id=130 I discovered some very interesting things with it, in regards to firewalls. (1) Microsoft's firewall with XP SP 2 allows it to connect. (2) An early (free) version of ZoneAlarm allows it to connect and only presents a question when I go to download the image files, however, it connects and bypasses the firewall 100% in the initial HTML reading (version 2.6.88). (3) Kerio Version 2.15 behaves in the same manner as the ZoneAlarm product, allowing the initial connection without questioning the connection, seeing the connection as being made via Internet Explorer rather than your application. Looks like it could help lock down firewall software. That's a neat little app there. Thanks, Chris. You mind if I pass the link along to some other folks to check out some other firewalls with it? Jim Carlock Post replies to the group. All good. Sometimes it hard to answer questions when you get the questions only, out of context of its use... -- Show quoteHide quoteChris Hanscom - Microsoft MVP (VB) Veign's Resource Center http://www.veign.com/vrc_main.asp Veign's Blog http://www.veign.com/blog -- "Jim Carlock" posted: > "Veign" <NOSPAMinveign@veign.com> wrote: >> Not sure he wants to create a web-browser. Seems more like >> an application to extract data only from a page. > > Did I suggest creating a web-browser ? VB can do this without the > use of the HTML controls. It would take alittle more effort to get > javascript going and other such things, but does he care about the > javascript? Or does he just want the HTML and the <a> tags inside > the HTML. > > Parsing through the css code could get quite involved as well as far > as fonts and such goes and that might make things more complicated. > A rich-textbox control could help to get better rendering in that case, > if rendering was what was wanted. <shrug> > > It sounds like he wants to extract links and other specific items > from the text. Perhaps he's connecting to websites to extract email > addresses and spam us all? Ugh oh. How'd that pop into my head? > > Hmmm, maybe we're providing too much help here. <g> > > Perhaps he just wants some kind of tool rather than encoding it > himself? > > Hope this helps. > Jim Carlock "Jim Carlock" <anonymous@localhost> wrote in message Go for it. Feel free to compile it and distribute at will....news:eBdfz4qTGHA.3976@TK2MSFTNGP10.phx.gbl... > "Veign" <NOSPAMinveign@veign.com> wrote: > > I tried the GetStuff item in the link you provided. > > http://www.veign.com/vrc_codeview.asp?type=app&id=130 > > > You mind if I pass the link along to some other folks to check > out some other firewalls with it? Show quoteHide quote > > Jim Carlock > Post replies to the group. > > -- Chris Hanscom - Microsoft MVP (VB) Veign's Resource Center http://www.veign.com/vrc_main.asp Veign's Blog http://www.veign.com/blog -- > Did I suggest creating a web-browser ? VB can do this without the Yes, I care about javascript, and other techniques. So, regular expressions > use of the HTML controls. It would take alittle more effort to get > javascript going and other such things, but does he care about the > javascript? Or does he just want the HTML and the <a> tags inside > the HTML. aren't good for me. > A rich-textbox control could help to get better rendering in that case, Yes, I want to render the page, evaluate all javascript etc., and then > if rendering was what was wanted. <shrug> convert it to text. > It sounds like he wants to extract links and other specific items If I only want to extract email addresses, I would strip \r \n \t " ' > from the text. Perhaps he's connecting to websites to extract email > addresses and spam us all? Ugh oh. How'd that pop into my head? characters out of html page, < > [ ] { } ; characters would replace by a single space, split everything by space, and check each element, if it contains http:// https:// ftp:// mailto: ^www.(.*) ^ftp.(.*) ^href=(.*) or is a valid email address. That's a very simple example... So, as you see, I don't need such tool for this. Show quoteHide quote U¿ytkownik "Jim Carlock" <anonymous@localhost> napisa³ w wiadomo¶ci news:uAT0UBqTGHA.5900@tk2msftngp13.phx.gbl... > "Veign" <NOSPAMinveign@veign.com> wrote: >> Not sure he wants to create a web-browser. Seems more like >> an application to extract data only from a page. > > Did I suggest creating a web-browser ? VB can do this without the > use of the HTML controls. It would take alittle more effort to get > javascript going and other such things, but does he care about the > javascript? Or does he just want the HTML and the <a> tags inside > the HTML. > > Parsing through the css code could get quite involved as well as far > as fonts and such goes and that might make things more complicated. > A rich-textbox control could help to get better rendering in that case, > if rendering was what was wanted. <shrug> > > It sounds like he wants to extract links and other specific items > from the text. Perhaps he's connecting to websites to extract email > addresses and spam us all? Ugh oh. How'd that pop into my head? > > Hmmm, maybe we're providing too much help here. <g> > > Perhaps he just wants some kind of tool rather than encoding it > himself? > > Hope this helps. > Jim Carlock > > "Jim Carlock" posted: >> "Tomasz Klim" <tklim***@komputronik.pl> wrote: >>> I'm looking for a converter from HTML to text. But not a simple one, or >>> even a regular expression, but one that can convert also complicated >>> pages, with embedded styles, dynamic content etc. >>> >>> Does anybody know such tool? >>> >>> I know, that Internet Explorer can load a web page, and dump it in both >>> html/text format - but I need something faster, and without need to >>> downloading the page - just converting... Maybe do you know, how to use >>> IE for it? >> >> (1) Web-browsers work by sending GET commands to a webserver. >> (2) You can use a winsock control to do the same thing. >> (3) You only want the initial html document and not the underlying >> pictures, so you then parse the downloaded html and possibly grab >> the appropriate CSS by issuing a GET for those as well. You skip >> sending out GET commands for all the un-needed things. >> >> http://vbip.com/winsock/winsock_http_01.asp >> http://vbip.com/winsock/winsock_http_08_01.asp >> >> There's more information here, scroll down to "WWW and HyperText >> Transfer Protocol", when you get there, http://vbip.com/winsock/. >> >> It's not all that hard to create your own simple web-browser, which >> grabs text and perhaps css documents for layout. >> >> Hope this helps. >> >> Jim Carlock > > Here's an option:
http://www.veign.com/vrc_codeview.asp?type=app&id=130 I have used the above in conjunction with Regular Expressions to parse out all data from a webpage. As far as getting around embedded CSS that's pretty easy: 1) Its will either be contained within a <STYLE> tag so you could throw out the tag and everything between it or 2) It will be contained within a Style attribute of a tag so you when parse out the tag you could throw out this attribute or all attributes and only get the data between tags. -- Show quoteHide quoteChris Hanscom - Microsoft MVP (VB) Veign's Resource Center http://www.veign.com/vrc_main.asp Veign's Blog http://www.veign.com/blog -- "Tomasz Klim" <tklim***@komputronik.pl> wrote in message news:eELumRlTGHA.1576@tk2msftngp13.phx.gbl... >> HTML is Text. You rule out regular expressions but in your case that >> may be the best, fastest, and easiest way to extract out all HTML tags >> from a webpage. > > Yes, I know, but when the page contains "advanced" content, like embedded > CSS, or even JS, regular expressions won't remove it. Furthermore, I want > to preserve all links from the original html code. > >> Also, you say without downloading a page. That's how the web works. You >> are not viewing the contents of the page on the server but a local copy >> that has been downloaded to your system from the server - no way around >> it. > > You misunderstood me. I tried to say, that I have a page already on my > local disk. > > So? > > > > U¿ytkownik "Veign" <NOSPAMinveign@veign.com> napisa³ w wiadomo¶ci > news:elYTz1hTGHA.2276@tk2msftngp13.phx.gbl... >> HTML is Text. You rule out regular expressions but in your case that >> may be the best, fastest, and easiest way to extract out all HTML tags >> from a webpage. >> >> Also, you say without downloading a page. That's how the web works. You >> are not viewing the contents of the page on the server but a local copy >> that has been downloaded to your system from the server - no way around >> it. >> >> -- >> Chris Hanscom - Microsoft MVP (VB) >> Veign's Resource Center >> http://www.veign.com/vrc_main.asp >> Veign's Blog >> http://www.veign.com/blog >> -- >> >> >> "Tomasz Klim" <tklim***@komputronik.pl> wrote in message >> news:%23LKkMPgTGHA.3976@TK2MSFTNGP10.phx.gbl... >>> Hi >>> >>> I'm looking for a converter from HTML to text. But not a simple one, or >>> even a regular expression, but one that can convert also complicated >>> pages, with embedded styles, dynamic content etc. >>> >>> Does anybody know such tool? >>> >>> I know, that Internet Explorer can load a web page, and dump it in both >>> html/text format - but I need something faster, and without need to >>> downloading the page - just converting... Maybe do you know, how to use >>> IE for it? >>> >>> Thanks in advance! >>> >>> >> >> > >
Error when running vb app with FlexGrid control
Help with disconnected recordset! question -- close adodb connection inside dll CommonDialog sets Windows Default Printer - is Mike Williams out there? Vendor ActiveX control click on a menu Comparing UDTs ADODB Recordsets into Array Error creating Oracle object WebBrowser vs Pdf.ocx |
|||||||||||||||||||||||