CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 15 of 15
  1. #1
    Join Date
    Dec 2008
    Location
    Canada, Saskatchewan province
    Posts
    29

    Lightbulb How to obtain HtmlDocument from a string of a webpage?

    Hi,

    In my windows forms application I used to visit webpage using the WebBrowser and obtain an HtmlDocument by using the Document property of WebBrowser which works fine in win form application.

    I'm wanting to use a Console Application and use the WebClient or WebRequest class (because it seems faster to visit pages than WebBrowser) and I want to obtain a HtmlDocument. Seems there are lots of ways to obtain a string representation of a webpage using these classes.

    Is there a way to obtain an HtmlDocument object by somehow parsing the string to rebuild a possible HtmlDocument? This would be very handy in my situation whereby I'll then be able to use the DOM instead of trying to build complexe Regular Expressions to match patterns in the html.

    Any ideas, much appreciated

    please and thanks,
    Ricky,

  2. #2
    Join Date
    Feb 2007
    Location
    Craiova, Romania
    Posts
    326

    Re: How to obtain HtmlDocument from a string of a webpage?

    You can use the HtmlDocument.DomDocument unmanaged pointer to write the HTML.
    Use the IHTMLDocument2::writeln or write method (both unmanaged, remember).
    http://msdn.microsoft.com/en-us/libr...40(VS.85).aspx

  3. #3
    Join Date
    Dec 2008
    Location
    Canada, Saskatchewan province
    Posts
    29

    Thumbs up Re: How to obtain HtmlDocument from a string of a webpage?

    Thanks so much, reading on MSDN about it now
    Ricky,

  4. #4
    Join Date
    Dec 2008
    Location
    Canada, Saskatchewan province
    Posts
    29

    Re: How to obtain HtmlDocument from a string of a webpage?

    Ok, I have no idea how writeln or write would allow me to generate an HtmlDocument from a string that holds all the html from the page.

    I'm using

    Code:
    WebClient myClient = new WebClient();
    string webPageString = myClient.DownloadString(url);

    now I want to take webPageString and somehow end up with an HtmlDocument so then I can use DOM. Perhaps I'm doing something wrong, maybe there is an easier way to end up using Dom on a webpage without using WebBrowser?
    Ricky,

  5. #5
    Join Date
    Feb 2007
    Location
    Craiova, Romania
    Posts
    326

    Re: How to obtain HtmlDocument from a string of a webpage?

    It is possible but there's an easier solution .
    Assuming you already have the HTML string, you can do (C# code):
    Code:
            HtmlDocument doc = webBrowser1.Document.OpenNew(true);
            doc.Write(webPageString);
    ...as seen here: http://msdn.microsoft.com/en-us/libr...ent.write.aspx

  6. #6
    Join Date
    Dec 2008
    Location
    Canada, Saskatchewan province
    Posts
    29

    Re: How to obtain HtmlDocument from a string of a webpage?

    marceln, well, I'm not actually using WebBrowser, I'm using WebClient so your solution won't work, but thanks for responding.

    I've located something call the html agility pack that allows what I want to do but I'm hoping the .net framework has a solution that I'm overlooking as I'd like to say no third party classes needed.

    Still looking for a good solution,
    Ricky,

  7. #7
    Join Date
    Feb 2007
    Location
    Craiova, Romania
    Posts
    326

    Re: How to obtain HtmlDocument from a string of a webpage?

    You can't have a HtmlDocument without a WebBrowser. Maybe if you explain WHAT are you trying to achieve, we can find an alternate solution.

  8. #8
    Join Date
    Feb 2007
    Location
    Craiova, Romania
    Posts
    326

    Re: How to obtain HtmlDocument from a string of a webpage?

    To clarify, I thought you want to download an HTML string from somewhere and then load it in a WebBrowser control, without using its URL property.
    This is what results from your first post:
    Is there a way to obtain an HtmlDocument object by somehow parsing the string to rebuild a possible HtmlDocument? This would be very handy in my situation whereby I'll then be able to use the DOM instead of trying to build complexe Regular Expressions to match patterns in the html.

  9. #9
    Join Date
    Dec 2008
    Location
    Canada, Saskatchewan province
    Posts
    29

    Re: How to obtain HtmlDocument from a string of a webpage?

    Ok, this is what I want to do.

    I will be visiting 1,000 correctly formed html webpages. (proper markup, no errors at all). I will want to grab the data inside all of the <p> (paragraph) tags from each page and store it (I can do this no problem).

    So I want to visit the webpage and use the DOM to get the paragraph inner text. However WebBrowser is too slow and I want to somehow be able to do it with either WebClient or WebRequest instead.
    Ricky,

  10. #10
    Join Date
    Feb 2007
    Location
    Craiova, Romania
    Posts
    326

    Re: How to obtain HtmlDocument from a string of a webpage?

    Ok, I see now.
    You can either use a custom HTML parser, or user a MSHTML.HTMLDocument and load it like in this example: http://www.codeguru.com/vb/vb_intern...icle.php/c4815

    Then, you can iterate through it and get the data you're interested in. Note that doing things this way you'll no longer need the WebRequest and WebClient functionality.

  11. #11
    Join Date
    Nov 2008
    Posts
    19

    Re: How to obtain HtmlDocument from a string of a webpage?

    Quote Originally Posted by marceln View Post
    Ok, I see now.
    You can either use a custom HTML parser, or user a MSHTML.HTMLDocument and load it like in this example: http://www.codeguru.com/vb/vb_intern...icle.php/c4815

    Then, you can iterate through it and get the data you're interested in. Note that doing things this way you'll no longer need the WebRequest and WebClient functionality.
    This code very nice.But i'm a noob.I'm using C# not VB6.
    Can u convert this code to C# bro ?

    Thanks so much

  12. #12
    Join Date
    Feb 2007
    Location
    Craiova, Romania
    Posts
    326

    Re: How to obtain HtmlDocument from a string of a webpage?

    It is straightforward. The needed steps are:
    - Open your c# project;
    - Right click the project and select "Add refrence";
    - In the "Add reference" dialog select the COM tab;
    - Select "Microsoft HTML Object Library";
    - Click "Ok";
    - Now translate the 8-10 lines of code from VB to c#. The types are exactly the same.

    Good Luck!

  13. #13
    Join Date
    Nov 2008
    Posts
    19

    Re: How to obtain HtmlDocument from a string of a webpage?

    Quote Originally Posted by marceln View Post
    It is straightforward. The needed steps are:
    - Open your c# project;
    - Right click the project and select "Add refrence";
    - In the "Add reference" dialog select the COM tab;
    - Select "Microsoft HTML Object Library";
    - Click "Ok";
    - Now translate the 8-10 lines of code from VB to c#. The types are exactly the same.

    Good Luck!
    i did that !
    But not works

    Here my code c# :

    mshtml.HTMLLinkElement objLink = null;
    mshtml.HTMLDocument objMSHTML = new mshtml.HTMLDocument();
    mshtml.HTMLDocument objDocument = null;

    objDocument = objMSHTML.createDocumentFromUrl("http://google.com");


    and error :

    Error 1 No overload for method 'createDocumentFromUrl' takes '1'

    Any idea yet ?

  14. #14
    Join Date
    Feb 2007
    Location
    Craiova, Romania
    Posts
    326

    Re: How to obtain HtmlDocument from a string of a webpage?

    try with:
    Code:
    mshtml.HTMLLinkElement objLink = null;
                mshtml.HTMLDocument objMSHTML = new mshtml.HTMLDocument();
                mshtml.HTMLDocument objDocument = null;
    mshtml.IHTMLDocument2 document = null;
    objDocument = objMSHTML.createDocumentFromUrl("http://google.com", null, document);
    see: http://msdn.microsoft.com/en-us/libr...23(VS.85).aspx

  15. #15
    Join Date
    Nov 2008
    Posts
    19

    Re: How to obtain HtmlDocument from a string of a webpage?

    Quote Originally Posted by marceln View Post
    try with:
    Code:
    mshtml.HTMLLinkElement objLink = null;
                mshtml.HTMLDocument objMSHTML = new mshtml.HTMLDocument();
                mshtml.HTMLDocument objDocument = null;
    mshtml.IHTMLDocument2 document = null;
    objDocument = objMSHTML.createDocumentFromUrl("http://google.com", null, document);
    see: http://msdn.microsoft.com/en-us/libr...23(VS.85).aspx
    dont works yet
    Thanks for answear !
    Any idea else bro ?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured