|
-
February 16th, 2003, 06:24 PM
#1
Html 2 txt
I was wondering if there are any algorithms for converting html to text. I've found some utilities, but those aren't what I need.
I'm developing an application and one of the specs are HTML to Text Conversion.
Your help would be greatly appreciated
-
February 17th, 2003, 09:24 AM
#2
In the MSHTML API library you can open a HTML file and request just the Text from it. This is better than trying to write your own. You will find it under references as "Microsoft HTML Object Library".
-
February 17th, 2003, 11:39 AM
#3
Thank you,
But I'm not using Windows, and I need to write my own.
-
February 18th, 2003, 05:45 AM
#4
Ok, what language and OS are you doing this under?
As for parsing the data you may have to walk the character data. My first thought is anytime you see >hhhhh< the text is the data between > and <, but you might run into the < or > being part of the text so you have to validate the tags.
However not knowing the way you are looking at the data you may need to know all the element names, to properly indentify which check out Quadzilla as he has a nice listing of all the keywords valid for HTML.
-
February 28th, 2003, 05:31 AM
#5
Found this today, did not even realize it was there.
WDG Validator
It is an html source validator and they do have a GNU source code available. Will help speed along developement if you cannot find anything and need to write you own, if you know perl or can convert to a language you need.
-
March 1st, 2003, 01:53 AM
#6
Still not sure what you're trying to accomplish. Do you know and can you explain it? Sounds like you want the output to be plain (unformatted) text, right?
You've said you have to write this yourself. But do you also have to determine the logic behind it too? If the answer is "yes," do NOT read below here!
.
.
.
.
.
I think it can be broken down into the following tasks, in roughly this order:
o Delete everything before the <body> tag
o Delete all line breaks (and I do not mean <br>s) that are *within* HTML blocks (this may be the hardest part)
o Convert paragraph breaks into two line breaks (probably, depends on what you want)
o Perhaps convert <br>s to a line break
o Perhaps convert headings <hX> to line breaks (before and/or after)
o Convert <li>...</li>s and probably some other tags to a line break
o Somehow deal with tables (very difficult!)
o Delete all HTML tags (!! right?)
o Convert HTML entities ( & and many more)
That's all I can think of, but I may have missed something.
If my assumption is correct and the output is just plain text, the only formatting the output will have is line breaks, so line breaks will be your biggest challenge.
Larry
-
March 1st, 2003, 02:00 AM
#7
Everything came through in my last post, except the bit about HTML entities. I meant (and it said but was stripped) things like
& nbsp ;
& amp ;
& #anumber;
and many more
(but without the spaces shown here)
Larry
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|