CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 7 of 7

Thread: Html 2 txt

  1. #1
    Join Date
    Jan 2003
    Posts
    2

    Html 2 txt

    I was wondering if there are any algorithms for converting html to text. I've found some utilities, but those aren't what I need.

    I'm developing an application and one of the specs are HTML to Text Conversion.

    Your help would be greatly appreciated

  2. #2
    Join Date
    Jan 2003
    Location
    North Carolina
    Posts
    309
    In the MSHTML API library you can open a HTML file and request just the Text from it. This is better than trying to write your own. You will find it under references as "Microsoft HTML Object Library".

  3. #3
    Join Date
    Jan 2003
    Posts
    2
    Thank you,

    But I'm not using Windows, and I need to write my own.

  4. #4
    Join Date
    Jan 2003
    Location
    North Carolina
    Posts
    309
    Ok, what language and OS are you doing this under?

    As for parsing the data you may have to walk the character data. My first thought is anytime you see >hhhhh< the text is the data between > and <, but you might run into the < or > being part of the text so you have to validate the tags.

    However not knowing the way you are looking at the data you may need to know all the element names, to properly indentify which check out Quadzilla as he has a nice listing of all the keywords valid for HTML.

  5. #5
    Join Date
    Jan 2003
    Location
    North Carolina
    Posts
    309
    Found this today, did not even realize it was there.

    WDG Validator

    It is an html source validator and they do have a GNU source code available. Will help speed along developement if you cannot find anything and need to write you own, if you know perl or can convert to a language you need.

  6. #6
    Join Date
    Feb 2003
    Posts
    5
    Still not sure what you're trying to accomplish. Do you know and can you explain it? Sounds like you want the output to be plain (unformatted) text, right?

    You've said you have to write this yourself. But do you also have to determine the logic behind it too? If the answer is "yes," do NOT read below here!
    .
    .
    .
    .
    .

    I think it can be broken down into the following tasks, in roughly this order:
    o Delete everything before the <body> tag
    o Delete all line breaks (and I do not mean <br>s) that are *within* HTML blocks (this may be the hardest part)
    o Convert paragraph breaks into two line breaks (probably, depends on what you want)
    o Perhaps convert <br>s to a line break
    o Perhaps convert headings <hX> to line breaks (before and/or after)
    o Convert <li>...</li>s and probably some other tags to a line break
    o Somehow deal with tables (very difficult!)
    o Delete all HTML tags (!! right?)
    o Convert HTML entities (&nbsp; &amp; and many more)
    That's all I can think of, but I may have missed something.

    If my assumption is correct and the output is just plain text, the only formatting the output will have is line breaks, so line breaks will be your biggest challenge.

    Larry

  7. #7
    Join Date
    Feb 2003
    Posts
    5
    Everything came through in my last post, except the bit about HTML entities. I meant (and it said but was stripped) things like
    & nbsp ;
    & amp ;
    & #anumber;
    and many more
    (but without the spaces shown here)

    Larry

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured