Libcurl question

**Pyarone** · May 31st, 2011, 03:51 PM

Hi, I'm just getting started with libcurl, using it to retrieve web pages.

My problem is that when I use the same code on any page of a particular website, the text file that libcurl is generating does not match the source code of the web page.

This is the code I'm using:

Code:

int retrievePage(string url, string filePath){
	//libcurl code to retrive the source code of the url and store in the file path
	CURL *curl;
	FILE *fp;
	CURLcode res;
	curl = curl_easy_init();
	if(curl){
		fp = fopen(filePath.c_str(), "wb");
		curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
		curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, write_data);
		curl_easy_setopt(curl, CURLOPT_WRITEDATA, fp);
		curl_easy_setopt(curl, CURLOPT_VERBOSE, 1);
		res = curl_easy_perform(curl);
		curl_easy_cleanup(curl);
		fclose(fp);
	}
	return 0;
}//end retrievePage

This is a sample url on the web site I'm having trouble with:
http://www.pprbd.org/PublicAccess/Pe...essSearch.aspx

The previous code works fine on other similar search.aspx web pages on different sites.

Any hints would be greatly appreciated.

Thanks,

Pya

**Paul McKenzie** · May 31st, 2011, 05:03 PM

Originally Posted by Pyarone

My problem is that when I use the same code on any page of a particular website, the text file that libcurl is generating does not match the source code of the web page.

This is the code I'm using:

Code:

int retrievePage(string url, string filePath){
	//libcurl code to retrive the source code of the url and store in the file path
	CURL *curl;
	FILE *fp;
	CURLcode res;
	curl = curl_easy_init();
	if(curl){
		fp = fopen(filePath.c_str(), "wb");

I have never used libcurl, but the first thing any programmer would do is hard-code the names of the files, so as to ensure there isn't a simple mistake that is being made by using those parameters you're passing.

Secondly, you didn't check if the file was opened successfully. If fp happens to be NULL, what will the rest of the code do?

Regards,

Paul McKenzie

**Pyarone** · May 31st, 2011, 06:12 PM

Thanks for the tips, I'll have to do some polishing at some point, but I'm under time constraints and I need to get it pulling from the problem website asap. The file writing so far has not been a problem, even on the problem pages it does generate the file, but the source code contained in that output file is some sort of error page on the website, instead of the source code my web browser shows for the same URL.

**ninja9578** · June 1st, 2011, 08:00 PM

I see why. Servers will change what they return based on your User-Agent, and char-sets and such. I don't see a user-agent header being sent with you curl request. I'll bet if you just did

Code:

curl http://www.pprbd.org/PublicAccess/Pe...essSearch.aspx

or

Code:

wget http://www.pprbd.org/PublicAccess/Pe...essSearch.aspx

You would get what libcurl retrieves

Adding this code will make the server think you are Firefox

Code:

	   struct curl_slist *slist=NULL; // Linked list of request headers
	   slist = curl_slist_append(slist, "ACCEPT: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5");
	   slist = curl_slist_append(slist, "ACCEPT_CHARSET: ISO-8859-1,utf-8;q=0.7,*;q=0.7");
	   slist = curl_slist_append(slist, "ACCEPT_ENCODING: gzip,deflate");
	   slist = curl_slist_append(slist, "ACCEPT_LANGUAGE: en-gb,en;q=0.5");
	   slist = curl_slist_append(slist, "CONNECTION: keep-alive");
	   slist = curl_slist_append(slist, "KEEP_ALIVE: 300");
	   curl_easy_setopt(curl, CURLOPT_HTTPHEADER, slist); // Set these headers
	   const char *useragent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.16) Gecko/20080702 Firefox/2.0.0.16";
	   curl_easy_setopt(curl, CURLOPT_USERAGENT, useragent); // Useragent is the last thing to fake

**Pyarone** · June 6th, 2011, 12:53 AM

Thank you so much for your help! I'll test it as soon as I get to work in the morning, but I'm sure that will work great. You have no idea how much I'll learn from analyzing this stuff, I'm really excited.

**ninja9578** · June 6th, 2011, 02:08 PM

If you are on Windows, download a tool called Fiddler, it'll help you understand how web requests work, as well as debug the ones you make without having to go through tons of esoteric logs.

Thread: Libcurl question

Thread Tools

Display

Libcurl question

Re: Libcurl question

Re: Libcurl question

Re: Libcurl question

Re: Libcurl question

Re: Libcurl question

Posting Permissions