html parsing

**daeya2010** · August 17th, 2009, 08:56 AM

Hey guys!
i'm kind of new in the field. I need to get the text from inside a html tag (more specificaly <div class="only this div"). i need to exclude all other divs. I found this nice example http://www.developer.com/net/csharp/...0918_2230091_1 but i can only get the name and value of the tag. Is it possible to also get the entire text?
Thanks in advance

**monalin** · August 17th, 2009, 09:53 AM

Welcome to the forum,

If you can give me a sample of what your input looks like and what you want for the output then I can probably help you. Please try to be as specific as possible.

**daeya2010** · August 17th, 2009, 10:06 AM

example:
<div id="farright">
<div id="ads-right">
<div id="ads-right-twotop">
content from ads right twotop
enddiv ads right twotop
content ads right
end div ads right
end div farright

i would like to get only the content of the ads-right div. The html page is obviously much larger but i think it' a good example..

**monalin** · August 17th, 2009, 10:35 AM

So what your saying is if you had.

Code:

<div id="ads-right">
    <div>Blah blah blah</div>
</div>

You want

Code:

<div> Blah blah blah</div>

or

Code:

div id="ads-right"

**daeya2010** · August 17th, 2009, 10:46 AM

i need <div> Blah blah blah</div>

**AshBrennan** · August 17th, 2009, 11:51 AM

Code:

const string htmlTag = @"<div>(.*?)</div>";

Make sure to include a reference to System.Text.RegularExpressions

**monalin** · August 17th, 2009, 12:00 PM

Well... that sucks hah. You can't do that with the parser you linked me on the earlier post. There's no simple way to parse HTML without writing your own parser. There is a class in .NET called WebBrowser which works really well for this type of thing because you can parse through the HTML easily. Ironically though, if you're using this class in a website project its more difficult to get the WebBrowser class to work because it must be run in a STA thread and it has a couple events which must be handled, all very possible, but i can't just write 2 lines of code and be done with it.

All very possible but I don't have the time right now to write the complete functioning code. You may be able to find some examples on google for how to use the WebBrowser class. If you have any specific questions on how to get it to work i'll do my best to help ya out.

Its very likely that there exists a HTML parser which has already been written that will work for you, but I do not know of one. I'm sure one of the other posters here may have an idea.

**code?** · August 17th, 2009, 04:55 PM

You'll need to use mshtml.dll and then use Microsoft.

You'll open up new classes that are real *****es to use, but once you know how to work them everything slides right in.

Just use a WebClient to download the html string, or manipulate whatever. Insert it into HtmlDocument3/4/5/6 class htmlContent or whatever.

That's my solution. It does not involve any use of WebBrowser.

**monalin** · August 17th, 2009, 04:59 PM

Originally Posted by code?

You'll need to use mshtml.dll and then use Microsoft.

You'll open up new classes that are real *****es to use, but once you know how to work them everything slides right in.

Just use a WebClient to download the html string, or manipulate whatever. Insert it into HtmlDocument3/4/5/6 class htmlContent or whatever.

That's my solution. It does not involve any use of WebBrowser.

Yes the two both use mshtml.dll and are a pain to use. I've used both before, but like you said. Once you get it to work its very useful. Had to do it once to create a screenshot of any website... that was a fun little project hah.

Thread: html parsing

Thread Tools

Display

html parsing

Re: html parsing

Re: html parsing

Re: html parsing

Re: html parsing

Re: html parsing

Re: html parsing

Re: html parsing

Re: html parsing

Tags for this Thread

Posting Permissions