Click to See Complete Forum and Search --> : How to obtain HtmlDocument from a string of a webpage?
RickyWh
December 25th, 2008, 04:38 PM
Hi,
In my windows forms application I used to visit webpage using the WebBrowser and obtain an HtmlDocument by using the Document property of WebBrowser which works fine in win form application.
I'm wanting to use a Console Application and use the WebClient or WebRequest class (because it seems faster to visit pages than WebBrowser) and I want to obtain a HtmlDocument. Seems there are lots of ways to obtain a string representation of a webpage using these classes.
Is there a way to obtain an HtmlDocument object by somehow parsing the string to rebuild a possible HtmlDocument? This would be very handy in my situation whereby I'll then be able to use the DOM instead of trying to build complexe Regular Expressions to match patterns in the html.
Any ideas, much appreciated
please and thanks,
marceln
December 25th, 2008, 04:44 PM
You can use the HtmlDocument.DomDocument unmanaged pointer to write the HTML.
Use the IHTMLDocument2::writeln or write method (both unmanaged, remember).
http://msdn.microsoft.com/en-us/library/aa752640(VS.85).aspx
RickyWh
December 25th, 2008, 04:50 PM
Thanks so much, reading on MSDN about it now
RickyWh
December 25th, 2008, 05:25 PM
Ok, I have no idea how writeln or write would allow me to generate an HtmlDocument from a string that holds all the html from the page.
I'm using
WebClient myClient = new WebClient();
string webPageString = myClient.DownloadString(url);
now I want to take webPageString and somehow end up with an HtmlDocument so then I can use DOM. Perhaps I'm doing something wrong, maybe there is an easier way to end up using Dom on a webpage without using WebBrowser?
marceln
December 25th, 2008, 05:33 PM
It is possible but there's an easier solution :).
Assuming you already have the HTML string, you can do (C# code):
HtmlDocument doc = webBrowser1.Document.OpenNew(true);
doc.Write(webPageString);
...as seen here: http://msdn.microsoft.com/en-us/library/system.windows.forms.htmldocument.write.aspx
RickyWh
December 25th, 2008, 05:40 PM
marceln, well, I'm not actually using WebBrowser, I'm using WebClient so your solution won't work, but thanks for responding.
I've located something call the html agility pack that allows what I want to do but I'm hoping the .net framework has a solution that I'm overlooking as I'd like to say no third party classes needed.
Still looking for a good solution,
marceln
December 25th, 2008, 05:44 PM
You can't have a HtmlDocument without a WebBrowser. Maybe if you explain WHAT are you trying to achieve, we can find an alternate solution.
marceln
December 25th, 2008, 06:03 PM
To clarify, I thought you want to download an HTML string from somewhere and then load it in a WebBrowser control, without using its URL property.
This is what results from your first post:
Is there a way to obtain an HtmlDocument object by somehow parsing the string to rebuild a possible HtmlDocument? This would be very handy in my situation whereby I'll then be able to use the DOM instead of trying to build complexe Regular Expressions to match patterns in the html.
RickyWh
December 25th, 2008, 06:14 PM
Ok, this is what I want to do.
I will be visiting 1,000 correctly formed html webpages. (proper markup, no errors at all). I will want to grab the data inside all of the <p> (paragraph) tags from each page and store it (I can do this no problem).
So I want to visit the webpage and use the DOM to get the paragraph inner text. However WebBrowser is too slow and I want to somehow be able to do it with either WebClient or WebRequest instead.
marceln
December 25th, 2008, 06:46 PM
Ok, I see now.
You can either use a custom HTML parser, or user a MSHTML.HTMLDocument and load it like in this example: http://www.codeguru.com/vb/vb_internet/html/article.php/c4815
Then, you can iterate through it and get the data you're interested in. Note that doing things this way you'll no longer need the WebRequest and WebClient functionality.
tom_codon
December 28th, 2008, 05:26 PM
Ok, I see now.
You can either use a custom HTML parser, or user a MSHTML.HTMLDocument and load it like in this example: http://www.codeguru.com/vb/vb_internet/html/article.php/c4815
Then, you can iterate through it and get the data you're interested in. Note that doing things this way you'll no longer need the WebRequest and WebClient functionality.
This code very nice.But i'm a noob.I'm using C# not VB6.
Can u convert this code to C# bro ?
Thanks so much
marceln
December 28th, 2008, 06:00 PM
It is straightforward. The needed steps are:
- Open your c# project;
- Right click the project and select "Add refrence";
- In the "Add reference" dialog select the COM tab;
- Select "Microsoft HTML Object Library";
- Click "Ok";
- Now translate the 8-10 lines of code from VB to c#. The types are exactly the same.
Good Luck!
tom_codon
December 28th, 2008, 06:08 PM
It is straightforward. The needed steps are:
- Open your c# project;
- Right click the project and select "Add refrence";
- In the "Add reference" dialog select the COM tab;
- Select "Microsoft HTML Object Library";
- Click "Ok";
- Now translate the 8-10 lines of code from VB to c#. The types are exactly the same.
Good Luck!
i did that !
But not works
Here my code c# :
mshtml.HTMLLinkElement objLink = null;
mshtml.HTMLDocument objMSHTML = new mshtml.HTMLDocument();
mshtml.HTMLDocument objDocument = null;
objDocument = objMSHTML.createDocumentFromUrl("http://google.com");
and error :
Error 1 No overload for method 'createDocumentFromUrl' takes '1'
Any idea yet ?
marceln
December 28th, 2008, 06:21 PM
try with:
mshtml.HTMLLinkElement objLink = null;
mshtml.HTMLDocument objMSHTML = new mshtml.HTMLDocument();
mshtml.HTMLDocument objDocument = null;
mshtml.IHTMLDocument2 document = null;
objDocument = objMSHTML.createDocumentFromUrl("http://google.com", null, document);
see: http://msdn.microsoft.com/en-us/library/aa752523(VS.85).aspx
tom_codon
December 28th, 2008, 06:38 PM
try with:
mshtml.HTMLLinkElement objLink = null;
mshtml.HTMLDocument objMSHTML = new mshtml.HTMLDocument();
mshtml.HTMLDocument objDocument = null;
mshtml.IHTMLDocument2 document = null;
objDocument = objMSHTML.createDocumentFromUrl("http://google.com", null, document);
see: http://msdn.microsoft.com/en-us/library/aa752523(VS.85).aspx
dont works yet :(
Thanks for answear !
Any idea else bro ?
codeguru.com
Copyright Internet.com Inc., All Rights Reserved.