|
-
December 25th, 2008, 05:38 PM
#1
How to obtain HtmlDocument from a string of a webpage?
Hi,
In my windows forms application I used to visit webpage using the WebBrowser and obtain an HtmlDocument by using the Document property of WebBrowser which works fine in win form application.
I'm wanting to use a Console Application and use the WebClient or WebRequest class (because it seems faster to visit pages than WebBrowser) and I want to obtain a HtmlDocument. Seems there are lots of ways to obtain a string representation of a webpage using these classes.
Is there a way to obtain an HtmlDocument object by somehow parsing the string to rebuild a possible HtmlDocument? This would be very handy in my situation whereby I'll then be able to use the DOM instead of trying to build complexe Regular Expressions to match patterns in the html.
Any ideas, much appreciated
please and thanks,
Ricky,
-
December 25th, 2008, 05:44 PM
#2
Re: How to obtain HtmlDocument from a string of a webpage?
You can use the HtmlDocument.DomDocument unmanaged pointer to write the HTML.
Use the IHTMLDocument2::writeln or write method (both unmanaged, remember).
http://msdn.microsoft.com/en-us/libr...40(VS.85).aspx
-
December 25th, 2008, 05:50 PM
#3
Re: How to obtain HtmlDocument from a string of a webpage?
Thanks so much, reading on MSDN about it now
Ricky,
-
December 25th, 2008, 06:25 PM
#4
Re: How to obtain HtmlDocument from a string of a webpage?
Ok, I have no idea how writeln or write would allow me to generate an HtmlDocument from a string that holds all the html from the page.
I'm using
Code:
WebClient myClient = new WebClient();
string webPageString = myClient.DownloadString(url);
now I want to take webPageString and somehow end up with an HtmlDocument so then I can use DOM. Perhaps I'm doing something wrong, maybe there is an easier way to end up using Dom on a webpage without using WebBrowser?
Ricky,
-
December 25th, 2008, 06:33 PM
#5
Re: How to obtain HtmlDocument from a string of a webpage?
It is possible but there's an easier solution .
Assuming you already have the HTML string, you can do (C# code):
Code:
HtmlDocument doc = webBrowser1.Document.OpenNew(true);
doc.Write(webPageString);
...as seen here: http://msdn.microsoft.com/en-us/libr...ent.write.aspx
-
December 25th, 2008, 06:40 PM
#6
Re: How to obtain HtmlDocument from a string of a webpage?
marceln, well, I'm not actually using WebBrowser, I'm using WebClient so your solution won't work, but thanks for responding.
I've located something call the html agility pack that allows what I want to do but I'm hoping the .net framework has a solution that I'm overlooking as I'd like to say no third party classes needed.
Still looking for a good solution,
Ricky,
-
December 25th, 2008, 06:44 PM
#7
Re: How to obtain HtmlDocument from a string of a webpage?
You can't have a HtmlDocument without a WebBrowser. Maybe if you explain WHAT are you trying to achieve, we can find an alternate solution.
-
December 25th, 2008, 07:03 PM
#8
Re: How to obtain HtmlDocument from a string of a webpage?
To clarify, I thought you want to download an HTML string from somewhere and then load it in a WebBrowser control, without using its URL property.
This is what results from your first post:
Is there a way to obtain an HtmlDocument object by somehow parsing the string to rebuild a possible HtmlDocument? This would be very handy in my situation whereby I'll then be able to use the DOM instead of trying to build complexe Regular Expressions to match patterns in the html.
-
December 25th, 2008, 07:14 PM
#9
Re: How to obtain HtmlDocument from a string of a webpage?
Ok, this is what I want to do.
I will be visiting 1,000 correctly formed html webpages. (proper markup, no errors at all). I will want to grab the data inside all of the <p> (paragraph) tags from each page and store it (I can do this no problem).
So I want to visit the webpage and use the DOM to get the paragraph inner text. However WebBrowser is too slow and I want to somehow be able to do it with either WebClient or WebRequest instead.
Ricky,
-
December 25th, 2008, 07:46 PM
#10
Re: How to obtain HtmlDocument from a string of a webpage?
Ok, I see now.
You can either use a custom HTML parser, or user a MSHTML.HTMLDocument and load it like in this example: http://www.codeguru.com/vb/vb_intern...icle.php/c4815
Then, you can iterate through it and get the data you're interested in. Note that doing things this way you'll no longer need the WebRequest and WebClient functionality.
-
December 28th, 2008, 06:26 PM
#11
Re: How to obtain HtmlDocument from a string of a webpage?
 Originally Posted by marceln
Ok, I see now.
You can either use a custom HTML parser, or user a MSHTML.HTMLDocument and load it like in this example: http://www.codeguru.com/vb/vb_intern...icle.php/c4815
Then, you can iterate through it and get the data you're interested in. Note that doing things this way you'll no longer need the WebRequest and WebClient functionality.
This code very nice.But i'm a noob.I'm using C# not VB6.
Can u convert this code to C# bro ?
Thanks so much
-
December 28th, 2008, 07:00 PM
#12
Re: How to obtain HtmlDocument from a string of a webpage?
It is straightforward. The needed steps are:
- Open your c# project;
- Right click the project and select "Add refrence";
- In the "Add reference" dialog select the COM tab;
- Select "Microsoft HTML Object Library";
- Click "Ok";
- Now translate the 8-10 lines of code from VB to c#. The types are exactly the same.
Good Luck!
-
December 28th, 2008, 07:08 PM
#13
Re: How to obtain HtmlDocument from a string of a webpage?
 Originally Posted by marceln
It is straightforward. The needed steps are:
- Open your c# project;
- Right click the project and select "Add refrence";
- In the "Add reference" dialog select the COM tab;
- Select "Microsoft HTML Object Library";
- Click "Ok";
- Now translate the 8-10 lines of code from VB to c#. The types are exactly the same.
Good Luck!
i did that !
But not works
Here my code c# :
mshtml.HTMLLinkElement objLink = null;
mshtml.HTMLDocument objMSHTML = new mshtml.HTMLDocument();
mshtml.HTMLDocument objDocument = null;
objDocument = objMSHTML.createDocumentFromUrl("http://google.com");
and error :
Error 1 No overload for method 'createDocumentFromUrl' takes '1'
Any idea yet ?
-
December 28th, 2008, 07:21 PM
#14
Re: How to obtain HtmlDocument from a string of a webpage?
try with:
Code:
mshtml.HTMLLinkElement objLink = null;
mshtml.HTMLDocument objMSHTML = new mshtml.HTMLDocument();
mshtml.HTMLDocument objDocument = null;
mshtml.IHTMLDocument2 document = null;
objDocument = objMSHTML.createDocumentFromUrl("http://google.com", null, document);
see: http://msdn.microsoft.com/en-us/libr...23(VS.85).aspx
-
December 28th, 2008, 07:38 PM
#15
Re: How to obtain HtmlDocument from a string of a webpage?
 Originally Posted by marceln
try with:
Code:
mshtml.HTMLLinkElement objLink = null;
mshtml.HTMLDocument objMSHTML = new mshtml.HTMLDocument();
mshtml.HTMLDocument objDocument = null;
mshtml.IHTMLDocument2 document = null;
objDocument = objMSHTML.createDocumentFromUrl("http://google.com", null, document);
see: http://msdn.microsoft.com/en-us/libr...23(VS.85).aspx
dont works yet 
Thanks for answear !
Any idea else bro ?
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|