Home All Groups Group Topic Archive Search About
Author
22 Mar 2006 10:56 PM
Tomasz Klim
Hi

I'm looking for a converter from HTML to text. But not a simple one, or even
a regular expression, but one that can convert also complicated pages, with
embedded styles, dynamic content etc.

Does anybody know such tool?

I know, that Internet Explorer can load a web page, and dump it in both
html/text format - but I need something faster, and without need to
downloading the page - just converting... Maybe do you know, how to use IE
for it?

Thanks in advance!

Author
23 Mar 2006 1:55 AM
Veign
HTML is Text.   You rule out regular expressions but in your case that may
be the best, fastest, and easiest way to extract out all HTML tags from a
webpage.

Also, you say without downloading a page.  That's how the web works.  You
are not viewing the contents of the page on the server but a local copy that
has been downloaded to your system from the server - no way around it.

--
Chris Hanscom - Microsoft MVP (VB)
Veign's Resource Center
http://www.veign.com/vrc_main.asp
Veign's Blog
http://www.veign.com/blog
--


Show quoteHide quote
"Tomasz Klim" <tklim***@komputronik.pl> wrote in message
news:%23LKkMPgTGHA.3976@TK2MSFTNGP10.phx.gbl...
> Hi
>
> I'm looking for a converter from HTML to text. But not a simple one, or
> even a regular expression, but one that can convert also complicated
> pages, with embedded styles, dynamic content etc.
>
> Does anybody know such tool?
>
> I know, that Internet Explorer can load a web page, and dump it in both
> html/text format - but I need something faster, and without need to
> downloading the page - just converting... Maybe do you know, how to use IE
> for it?
>
> Thanks in advance!
>
>
Author
23 Mar 2006 8:32 AM
Tomasz Klim
> HTML is Text.   You rule out regular expressions but in your case that may
> be the best, fastest, and easiest way to extract out all HTML tags from a
> webpage.

Yes, I know, but when the page contains "advanced" content, like embedded
CSS, or even JS, regular expressions won't remove it. Furthermore, I want to
preserve all links from the original html code.

> Also, you say without downloading a page.  That's how the web works.  You
> are not viewing the contents of the page on the server but a local copy
> that has been downloaded to your system from the server - no way around
> it.

You misunderstood me. I tried to say, that I have a page already on my local
disk.

So?



Show quoteHide quote
U¿ytkownik "Veign" <NOSPAMinveign@veign.com> napisa³ w wiadomo¶ci
news:elYTz1hTGHA.2276@tk2msftngp13.phx.gbl...
> HTML is Text.   You rule out regular expressions but in your case that may
> be the best, fastest, and easiest way to extract out all HTML tags from a
> webpage.
>
> Also, you say without downloading a page.  That's how the web works.  You
> are not viewing the contents of the page on the server but a local copy
> that has been downloaded to your system from the server - no way around
> it.
>
> --
> Chris Hanscom - Microsoft MVP (VB)
> Veign's Resource Center
> http://www.veign.com/vrc_main.asp
> Veign's Blog
> http://www.veign.com/blog
> --
>
>
> "Tomasz Klim" <tklim***@komputronik.pl> wrote in message
> news:%23LKkMPgTGHA.3976@TK2MSFTNGP10.phx.gbl...
>> Hi
>>
>> I'm looking for a converter from HTML to text. But not a simple one, or
>> even a regular expression, but one that can convert also complicated
>> pages, with embedded styles, dynamic content etc.
>>
>> Does anybody know such tool?
>>
>> I know, that Internet Explorer can load a web page, and dump it in both
>> html/text format - but I need something faster, and without need to
>> downloading the page - just converting... Maybe do you know, how to use
>> IE for it?
>>
>> Thanks in advance!
>>
>>
>
>
Author
23 Mar 2006 12:10 PM
George Bashore
Tomasz
Goto Google and search the newsgroups with this.
html, text,  larry serflaten
George


Show quoteHide quote
"Tomasz Klim" <tklim***@komputronik.pl> wrote in message
news:eELumRlTGHA.1576@tk2msftngp13.phx.gbl...
>> HTML is Text.   You rule out regular expressions but in your case that
>> may be the best, fastest, and easiest way to extract out all HTML tags
>> from a webpage.
>
> Yes, I know, but when the page contains "advanced" content, like embedded
> CSS, or even JS, regular expressions won't remove it. Furthermore, I want
> to preserve all links from the original html code.
>
>> Also, you say without downloading a page.  That's how the web works.  You
>> are not viewing the contents of the page on the server but a local copy
>> that has been downloaded to your system from the server - no way around
>> it.
>
> You misunderstood me. I tried to say, that I have a page already on my
> local disk.
>
> So?
>
>
>
> U¿ytkownik "Veign" <NOSPAMinveign@veign.com> napisa³ w wiadomo¶ci
> news:elYTz1hTGHA.2276@tk2msftngp13.phx.gbl...
>> HTML is Text.   You rule out regular expressions but in your case that
>> may be the best, fastest, and easiest way to extract out all HTML tags
>> from a webpage.
>>
>> Also, you say without downloading a page.  That's how the web works.  You
>> are not viewing the contents of the page on the server but a local copy
>> that has been downloaded to your system from the server - no way around
>> it.
>>
>> --
>> Chris Hanscom - Microsoft MVP (VB)
>> Veign's Resource Center
>> http://www.veign.com/vrc_main.asp
>> Veign's Blog
>> http://www.veign.com/blog
>> --
>>
>>
>> "Tomasz Klim" <tklim***@komputronik.pl> wrote in message
>> news:%23LKkMPgTGHA.3976@TK2MSFTNGP10.phx.gbl...
>>> Hi
>>>
>>> I'm looking for a converter from HTML to text. But not a simple one, or
>>> even a regular expression, but one that can convert also complicated
>>> pages, with embedded styles, dynamic content etc.
>>>
>>> Does anybody know such tool?
>>>
>>> I know, that Internet Explorer can load a web page, and dump it in both
>>> html/text format - but I need something faster, and without need to
>>> downloading the page - just converting... Maybe do you know, how to use
>>> IE for it?
>>>
>>> Thanks in advance!
>>>
>>>
>>
>>
>
>
Author
23 Mar 2006 4:10 PM
Jim Carlock
"Tomasz Klim" <tklim***@komputronik.pl> wrote:
> I'm looking for a converter from HTML to text. But not a simple one, or
> even a regular expression, but one that can convert also complicated
> pages, with embedded styles, dynamic content etc.
>
> Does anybody know such tool?
>
> I know, that Internet Explorer can load a web page, and dump it in both
> html/text format - but I need something faster, and without need to
> downloading the page - just converting... Maybe do you know, how to use
> IE for it?

(1) Web-browsers work by sending GET commands to a webserver.
(2) You can use a winsock control to do the same thing.
(3) You only want the initial html document and not the underlying
pictures, so you then parse the downloaded html and possibly grab
the appropriate CSS by issuing a GET for those as well. You skip
sending out GET commands for all the un-needed things.

http://vbip.com/winsock/winsock_http_01.asp
http://vbip.com/winsock/winsock_http_08_01.asp

There's more information here, scroll down to "WWW and HyperText
Transfer Protocol", when you get there, http://vbip.com/winsock/.

It's not all that hard to create your own simple web-browser, which
grabs text and perhaps css documents for layout.

Hope this helps.

Jim Carlock
Post replies to the group.
Author
23 Mar 2006 4:23 PM
Veign
Not sure he wants to create a web-browser.  Seems more like an application
to extract data only from a page.

--
Chris Hanscom - Microsoft MVP (VB)
Veign's Resource Center
http://www.veign.com/vrc_main.asp
Veign's Blog
http://www.veign.com/blog
--


Show quoteHide quote
"Jim Carlock" <anonymous@localhost> wrote in message
news:%23jvfhSpTGHA.424@TK2MSFTNGP12.phx.gbl...
> "Tomasz Klim" <tklim***@komputronik.pl> wrote:
>> I'm looking for a converter from HTML to text. But not a simple one, or
>> even a regular expression, but one that can convert also complicated
>> pages, with embedded styles, dynamic content etc.
>>
>> Does anybody know such tool?
>>
>> I know, that Internet Explorer can load a web page, and dump it in both
>> html/text format - but I need something faster, and without need to
>> downloading the page - just converting... Maybe do you know, how to use
>> IE for it?
>
> (1) Web-browsers work by sending GET commands to a webserver.
> (2) You can use a winsock control to do the same thing.
> (3) You only want the initial html document and not the underlying
> pictures, so you then parse the downloaded html and possibly grab
> the appropriate CSS by issuing a GET for those as well. You skip
> sending out GET commands for all the un-needed things.
>
> http://vbip.com/winsock/winsock_http_01.asp
> http://vbip.com/winsock/winsock_http_08_01.asp
>
> There's more information here, scroll down to "WWW and HyperText
> Transfer Protocol", when you get there, http://vbip.com/winsock/.
>
> It's not all that hard to create your own simple web-browser, which
> grabs text and perhaps css documents for layout.
>
> Hope this helps.
>
> Jim Carlock
> Post replies to the group.
>
>
Author
23 Mar 2006 5:34 PM
Jim Carlock
"Veign" <NOSPAMinveign@veign.com> wrote:
> Not sure he wants to create a web-browser. Seems more like
> an application to extract data only from a page.

Did I suggest creating a web-browser ? VB can do this without the
use of the HTML controls. It would take alittle more effort to get
javascript going and other such things, but does he care about the
javascript? Or does he just want the HTML and the <a> tags inside
the HTML.

Parsing through the css code could get quite involved as well as far
as fonts and such goes and that might make things more complicated.
A rich-textbox control could help to get better rendering in that case,
if rendering was what was wanted. <shrug>

It sounds like he wants to extract links and other specific items
from the text. Perhaps he's connecting to websites to extract email
addresses and spam us all? Ugh oh. How'd that pop into my head?

Hmmm, maybe we're providing too much help here. <g>

Perhaps he just wants some kind of tool rather than encoding it
himself?

Hope this helps.
Jim Carlock

"Jim Carlock" posted:
Show quoteHide quote
> "Tomasz Klim" <tklim***@komputronik.pl> wrote:
>> I'm looking for a converter from HTML to text. But not a simple one, or
>> even a regular expression, but one that can convert also complicated
>> pages, with embedded styles, dynamic content etc.
>>
>> Does anybody know such tool?
>>
>> I know, that Internet Explorer can load a web page, and dump it in both
>> html/text format - but I need something faster, and without need to
>> downloading the page - just converting... Maybe do you know, how to use
>> IE for it?
>
> (1) Web-browsers work by sending GET commands to a webserver.
> (2) You can use a winsock control to do the same thing.
> (3) You only want the initial html document and not the underlying
> pictures, so you then parse the downloaded html and possibly grab
> the appropriate CSS by issuing a GET for those as well. You skip
> sending out GET commands for all the un-needed things.
>
> http://vbip.com/winsock/winsock_http_01.asp
> http://vbip.com/winsock/winsock_http_08_01.asp
>
> There's more information here, scroll down to "WWW and HyperText
> Transfer Protocol", when you get there, http://vbip.com/winsock/.
>
> It's not all that hard to create your own simple web-browser, which
> grabs text and perhaps css documents for layout.
>
> Hope this helps.
>
> Jim Carlock
Author
23 Mar 2006 6:09 PM
Veign
I took this statement:
"It's not all that hard to create your own simple web-browser, which
grabs text and perhaps css documents for layout."

All good.  Sometimes it hard to answer questions when you get the questions
only, out of context of its use...

--
Chris Hanscom - Microsoft MVP (VB)
Veign's Resource Center
http://www.veign.com/vrc_main.asp
Veign's Blog
http://www.veign.com/blog
--


Show quoteHide quote
"Jim Carlock" <anonymous@localhost> wrote in message
news:uAT0UBqTGHA.5900@tk2msftngp13.phx.gbl...
> "Veign" <NOSPAMinveign@veign.com> wrote:
>> Not sure he wants to create a web-browser. Seems more like
>> an application to extract data only from a page.
>
> Did I suggest creating a web-browser ? VB can do this without the
> use of the HTML controls. It would take alittle more effort to get
> javascript going and other such things, but does he care about the
> javascript? Or does he just want the HTML and the <a> tags inside
> the HTML.
>
> Parsing through the css code could get quite involved as well as far
> as fonts and such goes and that might make things more complicated.
> A rich-textbox control could help to get better rendering in that case,
> if rendering was what was wanted. <shrug>
>
> It sounds like he wants to extract links and other specific items
> from the text. Perhaps he's connecting to websites to extract email
> addresses and spam us all? Ugh oh. How'd that pop into my head?
>
> Hmmm, maybe we're providing too much help here. <g>
>
> Perhaps he just wants some kind of tool rather than encoding it
> himself?
>
> Hope this helps.
> Jim Carlock
>
> "Jim Carlock" posted:
>> "Tomasz Klim" <tklim***@komputronik.pl> wrote:
>>> I'm looking for a converter from HTML to text. But not a simple one, or
>>> even a regular expression, but one that can convert also complicated
>>> pages, with embedded styles, dynamic content etc.
>>>
>>> Does anybody know such tool?
>>>
>>> I know, that Internet Explorer can load a web page, and dump it in both
>>> html/text format - but I need something faster, and without need to
>>> downloading the page - just converting... Maybe do you know, how to use
>>> IE for it?
>>
>> (1) Web-browsers work by sending GET commands to a webserver.
>> (2) You can use a winsock control to do the same thing.
>> (3) You only want the initial html document and not the underlying
>> pictures, so you then parse the downloaded html and possibly grab
>> the appropriate CSS by issuing a GET for those as well. You skip
>> sending out GET commands for all the un-needed things.
>>
>> http://vbip.com/winsock/winsock_http_01.asp
>> http://vbip.com/winsock/winsock_http_08_01.asp
>>
>> There's more information here, scroll down to "WWW and HyperText
>> Transfer Protocol", when you get there, http://vbip.com/winsock/.
>>
>> It's not all that hard to create your own simple web-browser, which
>> grabs text and perhaps css documents for layout.
>>
>> Hope this helps.
>>
>> Jim Carlock
>
>
Author
23 Mar 2006 7:13 PM
Jim Carlock
"Veign" <NOSPAMinveign@veign.com> wrote:
> I took this statement:
> "It's not all that hard to create your own simple web-browser,
> which grabs text and perhaps css documents for layout."

<g> Yeah, I was throwing it out as a suggestion out and going to
wait and see what kind of reply came back. Kind of like baiting a
fish. We probably won't hear from him until after 5pm EST.

I tried the GetStuff item in the link you provided.

http://www.veign.com/vrc_codeview.asp?type=app&id=130

I discovered some very interesting things with it, in regards to
firewalls.

(1) Microsoft's firewall with XP SP 2 allows it to connect.
(2) An early (free) version of ZoneAlarm allows it to connect
and only presents a question when I go to download the image
files, however, it connects and bypasses the firewall 100% in
the initial HTML reading (version 2.6.88).
(3) Kerio Version 2.15 behaves in the same manner as the
ZoneAlarm product, allowing the initial connection without
questioning the connection, seeing the connection as being
made via Internet Explorer rather than your application.

Looks like it could help lock down firewall software.

That's a neat little app there. Thanks, Chris.

You mind if I pass the link along to some other folks to check
out some other firewalls with it?

Jim Carlock
Post replies to the group.


All good.  Sometimes it hard to answer questions when you get the questions
only, out of context of its use...

--
Chris Hanscom - Microsoft MVP (VB)
Veign's Resource Center
http://www.veign.com/vrc_main.asp
Veign's Blog
http://www.veign.com/blog
--

"Jim Carlock" posted:
Show quoteHide quote
> "Veign" <NOSPAMinveign@veign.com> wrote:
>> Not sure he wants to create a web-browser. Seems more like
>> an application to extract data only from a page.
>
> Did I suggest creating a web-browser ? VB can do this without the
> use of the HTML controls. It would take alittle more effort to get
> javascript going and other such things, but does he care about the
> javascript? Or does he just want the HTML and the <a> tags inside
> the HTML.
>
> Parsing through the css code could get quite involved as well as far
> as fonts and such goes and that might make things more complicated.
> A rich-textbox control could help to get better rendering in that case,
> if rendering was what was wanted. <shrug>
>
> It sounds like he wants to extract links and other specific items
> from the text. Perhaps he's connecting to websites to extract email
> addresses and spam us all? Ugh oh. How'd that pop into my head?
>
> Hmmm, maybe we're providing too much help here. <g>
>
> Perhaps he just wants some kind of tool rather than encoding it
> himself?
>
> Hope this helps.
> Jim Carlock
Author
23 Mar 2006 9:01 PM
Veign
"Jim Carlock" <anonymous@localhost> wrote in message
news:eBdfz4qTGHA.3976@TK2MSFTNGP10.phx.gbl...
> "Veign" <NOSPAMinveign@veign.com> wrote:
>
> I tried the GetStuff item in the link you provided.
>
> http://www.veign.com/vrc_codeview.asp?type=app&id=130
>
>
> You mind if I pass the link along to some other folks to check
> out some other firewalls with it?

Go for it.  Feel free to compile it and distribute at will....

Show quoteHide quote
>
> Jim Carlock
> Post replies to the group.
>
>

--
Chris Hanscom - Microsoft MVP (VB)
Veign's Resource Center
http://www.veign.com/vrc_main.asp
Veign's Blog
http://www.veign.com/blog
--
Author
24 Mar 2006 11:39 PM
Tomasz Klim
> Did I suggest creating a web-browser ? VB can do this without the
> use of the HTML controls. It would take alittle more effort to get
> javascript going and other such things, but does he care about the
> javascript? Or does he just want the HTML and the <a> tags inside
> the HTML.

Yes, I care about javascript, and other techniques. So, regular expressions
aren't good for me.

> A rich-textbox control could help to get better rendering in that case,
> if rendering was what was wanted. <shrug>

Yes, I want to render the page, evaluate all javascript etc., and then
convert it to text.

> It sounds like he wants to extract links and other specific items
> from the text. Perhaps he's connecting to websites to extract email
> addresses and spam us all? Ugh oh. How'd that pop into my head?

If I only want to extract email addresses, I would strip \r \n \t " '
characters out of html page, < > [ ] { } ; characters would replace by a
single space, split everything by space, and check each element, if it
contains http:// https:// ftp:// mailto: ^www.(.*) ^ftp.(.*) ^href=(.*) or
is a valid email address. That's a very simple example... So, as you see, I
don't need such tool for this.


Show quoteHide quote
U¿ytkownik "Jim Carlock" <anonymous@localhost> napisa³ w wiadomo¶ci
news:uAT0UBqTGHA.5900@tk2msftngp13.phx.gbl...
> "Veign" <NOSPAMinveign@veign.com> wrote:
>> Not sure he wants to create a web-browser. Seems more like
>> an application to extract data only from a page.
>
> Did I suggest creating a web-browser ? VB can do this without the
> use of the HTML controls. It would take alittle more effort to get
> javascript going and other such things, but does he care about the
> javascript? Or does he just want the HTML and the <a> tags inside
> the HTML.
>
> Parsing through the css code could get quite involved as well as far
> as fonts and such goes and that might make things more complicated.
> A rich-textbox control could help to get better rendering in that case,
> if rendering was what was wanted. <shrug>
>
> It sounds like he wants to extract links and other specific items
> from the text. Perhaps he's connecting to websites to extract email
> addresses and spam us all? Ugh oh. How'd that pop into my head?
>
> Hmmm, maybe we're providing too much help here. <g>
>
> Perhaps he just wants some kind of tool rather than encoding it
> himself?
>
> Hope this helps.
> Jim Carlock
>
> "Jim Carlock" posted:
>> "Tomasz Klim" <tklim***@komputronik.pl> wrote:
>>> I'm looking for a converter from HTML to text. But not a simple one, or
>>> even a regular expression, but one that can convert also complicated
>>> pages, with embedded styles, dynamic content etc.
>>>
>>> Does anybody know such tool?
>>>
>>> I know, that Internet Explorer can load a web page, and dump it in both
>>> html/text format - but I need something faster, and without need to
>>> downloading the page - just converting... Maybe do you know, how to use
>>> IE for it?
>>
>> (1) Web-browsers work by sending GET commands to a webserver.
>> (2) You can use a winsock control to do the same thing.
>> (3) You only want the initial html document and not the underlying
>> pictures, so you then parse the downloaded html and possibly grab
>> the appropriate CSS by issuing a GET for those as well. You skip
>> sending out GET commands for all the un-needed things.
>>
>> http://vbip.com/winsock/winsock_http_01.asp
>> http://vbip.com/winsock/winsock_http_08_01.asp
>>
>> There's more information here, scroll down to "WWW and HyperText
>> Transfer Protocol", when you get there, http://vbip.com/winsock/.
>>
>> It's not all that hard to create your own simple web-browser, which
>> grabs text and perhaps css documents for layout.
>>
>> Hope this helps.
>>
>> Jim Carlock
>
>
Author
23 Mar 2006 4:29 PM
Veign
Here's an option:
http://www.veign.com/vrc_codeview.asp?type=app&id=130

I have used the above in conjunction with Regular Expressions to parse out
all data from a webpage.  As far as getting around embedded CSS that's
pretty easy: 1) Its will either be contained within a <STYLE> tag so you
could throw out the tag and everything between it or 2) It will be contained
within a Style attribute of a tag so you when parse out the tag you could
throw out this attribute or all attributes and only get the data between
tags.


--
Chris Hanscom - Microsoft MVP (VB)
Veign's Resource Center
http://www.veign.com/vrc_main.asp
Veign's Blog
http://www.veign.com/blog
--


Show quoteHide quote
"Tomasz Klim" <tklim***@komputronik.pl> wrote in message
news:eELumRlTGHA.1576@tk2msftngp13.phx.gbl...
>> HTML is Text.   You rule out regular expressions but in your case that
>> may be the best, fastest, and easiest way to extract out all HTML tags
>> from a webpage.
>
> Yes, I know, but when the page contains "advanced" content, like embedded
> CSS, or even JS, regular expressions won't remove it. Furthermore, I want
> to preserve all links from the original html code.
>
>> Also, you say without downloading a page.  That's how the web works.  You
>> are not viewing the contents of the page on the server but a local copy
>> that has been downloaded to your system from the server - no way around
>> it.
>
> You misunderstood me. I tried to say, that I have a page already on my
> local disk.
>
> So?
>
>
>
> U¿ytkownik "Veign" <NOSPAMinveign@veign.com> napisa³ w wiadomo¶ci
> news:elYTz1hTGHA.2276@tk2msftngp13.phx.gbl...
>> HTML is Text.   You rule out regular expressions but in your case that
>> may be the best, fastest, and easiest way to extract out all HTML tags
>> from a webpage.
>>
>> Also, you say without downloading a page.  That's how the web works.  You
>> are not viewing the contents of the page on the server but a local copy
>> that has been downloaded to your system from the server - no way around
>> it.
>>
>> --
>> Chris Hanscom - Microsoft MVP (VB)
>> Veign's Resource Center
>> http://www.veign.com/vrc_main.asp
>> Veign's Blog
>> http://www.veign.com/blog
>> --
>>
>>
>> "Tomasz Klim" <tklim***@komputronik.pl> wrote in message
>> news:%23LKkMPgTGHA.3976@TK2MSFTNGP10.phx.gbl...
>>> Hi
>>>
>>> I'm looking for a converter from HTML to text. But not a simple one, or
>>> even a regular expression, but one that can convert also complicated
>>> pages, with embedded styles, dynamic content etc.
>>>
>>> Does anybody know such tool?
>>>
>>> I know, that Internet Explorer can load a web page, and dump it in both
>>> html/text format - but I need something faster, and without need to
>>> downloading the page - just converting... Maybe do you know, how to use
>>> IE for it?
>>>
>>> Thanks in advance!
>>>
>>>
>>
>>
>
>