CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 6 of 6
  1. #1
    Join Date
    May 2011
    Posts
    3

    Question Libcurl question

    Hi, I'm just getting started with libcurl, using it to retrieve web pages.

    My problem is that when I use the same code on any page of a particular website, the text file that libcurl is generating does not match the source code of the web page.

    This is the code I'm using:
    Code:
    int retrievePage(string url, string filePath){
    	//libcurl code to retrive the source code of the url and store in the file path
    	CURL *curl;
    	FILE *fp;
    	CURLcode res;
    	curl = curl_easy_init();
    	if(curl){
    		fp = fopen(filePath.c_str(), "wb");
    		curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
    		curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, write_data);
    		curl_easy_setopt(curl, CURLOPT_WRITEDATA, fp);
    		curl_easy_setopt(curl, CURLOPT_VERBOSE, 1);
    		res = curl_easy_perform(curl);
    		curl_easy_cleanup(curl);
    		fclose(fp);
    	}
    	return 0;
    }//end retrievePage
    This is a sample url on the web site I'm having trouble with:
    http://www.pprbd.org/PublicAccess/Pe...essSearch.aspx

    The previous code works fine on other similar search.aspx web pages on different sites.

    Any hints would be greatly appreciated.

    Thanks,

    Pya

  2. #2
    Join Date
    Apr 1999
    Posts
    27,449

    Re: Libcurl question

    Quote Originally Posted by Pyarone View Post
    My problem is that when I use the same code on any page of a particular website, the text file that libcurl is generating does not match the source code of the web page.

    This is the code I'm using:
    Code:
    int retrievePage(string url, string filePath){
    	//libcurl code to retrive the source code of the url and store in the file path
    	CURL *curl;
    	FILE *fp;
    	CURLcode res;
    	curl = curl_easy_init();
    	if(curl){
    		fp = fopen(filePath.c_str(), "wb");
    I have never used libcurl, but the first thing any programmer would do is hard-code the names of the files, so as to ensure there isn't a simple mistake that is being made by using those parameters you're passing.

    Secondly, you didn't check if the file was opened successfully. If fp happens to be NULL, what will the rest of the code do?

    Regards,

    Paul McKenzie

  3. #3
    Join Date
    May 2011
    Posts
    3

    Re: Libcurl question

    Thanks for the tips, I'll have to do some polishing at some point, but I'm under time constraints and I need to get it pulling from the problem website asap. The file writing so far has not been a problem, even on the problem pages it does generate the file, but the source code contained in that output file is some sort of error page on the website, instead of the source code my web browser shows for the same URL.

  4. #4
    Join Date
    Jan 2009
    Posts
    1,689

    Re: Libcurl question

    I see why. Servers will change what they return based on your User-Agent, and char-sets and such. I don't see a user-agent header being sent with you curl request. I'll bet if you just did

    Code:
    curl http://www.pprbd.org/PublicAccess/Pe...essSearch.aspx
    or
    Code:
    wget http://www.pprbd.org/PublicAccess/Pe...essSearch.aspx
    You would get what libcurl retrieves

    Adding this code will make the server think you are Firefox
    Code:
    	   struct curl_slist *slist=NULL; // Linked list of request headers
    	   slist = curl_slist_append(slist, "ACCEPT: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5");
    	   slist = curl_slist_append(slist, "ACCEPT_CHARSET: ISO-8859-1,utf-8;q=0.7,*;q=0.7");
    	   slist = curl_slist_append(slist, "ACCEPT_ENCODING: gzip,deflate");
    	   slist = curl_slist_append(slist, "ACCEPT_LANGUAGE: en-gb,en;q=0.5");
    	   slist = curl_slist_append(slist, "CONNECTION: keep-alive");
    	   slist = curl_slist_append(slist, "KEEP_ALIVE: 300");
    	   curl_easy_setopt(curl, CURLOPT_HTTPHEADER, slist); // Set these headers
    	   const char *useragent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.16) Gecko/20080702 Firefox/2.0.0.16";
    	   curl_easy_setopt(curl, CURLOPT_USERAGENT, useragent); // Useragent is the last thing to fake

  5. #5
    Join Date
    May 2011
    Posts
    3

    Re: Libcurl question

    Thank you so much for your help! I'll test it as soon as I get to work in the morning, but I'm sure that will work great. You have no idea how much I'll learn from analyzing this stuff, I'm really excited.
    Last edited by Pyarone; June 6th, 2011 at 01:07 AM.

  6. #6
    Join Date
    Jan 2009
    Posts
    1,689

    Re: Libcurl question

    If you are on Windows, download a tool called Fiddler, it'll help you understand how web requests work, as well as debug the ones you make without having to go through tons of esoteric logs.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured