-
May 31st, 2011, 03:51 PM
#1
Libcurl question
Hi, I'm just getting started with libcurl, using it to retrieve web pages.
My problem is that when I use the same code on any page of a particular website, the text file that libcurl is generating does not match the source code of the web page.
This is the code I'm using:
Code:
int retrievePage(string url, string filePath){
//libcurl code to retrive the source code of the url and store in the file path
CURL *curl;
FILE *fp;
CURLcode res;
curl = curl_easy_init();
if(curl){
fp = fopen(filePath.c_str(), "wb");
curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, write_data);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, fp);
curl_easy_setopt(curl, CURLOPT_VERBOSE, 1);
res = curl_easy_perform(curl);
curl_easy_cleanup(curl);
fclose(fp);
}
return 0;
}//end retrievePage
This is a sample url on the web site I'm having trouble with:
http://www.pprbd.org/PublicAccess/Pe...essSearch.aspx
The previous code works fine on other similar search.aspx web pages on different sites.
Any hints would be greatly appreciated.
Thanks,
Pya
-
May 31st, 2011, 05:03 PM
#2
Re: Libcurl question
Originally Posted by Pyarone
My problem is that when I use the same code on any page of a particular website, the text file that libcurl is generating does not match the source code of the web page.
This is the code I'm using:
Code:
int retrievePage(string url, string filePath){
//libcurl code to retrive the source code of the url and store in the file path
CURL *curl;
FILE *fp;
CURLcode res;
curl = curl_easy_init();
if(curl){
fp = fopen(filePath.c_str(), "wb");
I have never used libcurl, but the first thing any programmer would do is hard-code the names of the files, so as to ensure there isn't a simple mistake that is being made by using those parameters you're passing.
Secondly, you didn't check if the file was opened successfully. If fp happens to be NULL, what will the rest of the code do?
Regards,
Paul McKenzie
-
May 31st, 2011, 06:12 PM
#3
Re: Libcurl question
Thanks for the tips, I'll have to do some polishing at some point, but I'm under time constraints and I need to get it pulling from the problem website asap. The file writing so far has not been a problem, even on the problem pages it does generate the file, but the source code contained in that output file is some sort of error page on the website, instead of the source code my web browser shows for the same URL.
-
June 1st, 2011, 08:00 PM
#4
Re: Libcurl question
I see why. Servers will change what they return based on your User-Agent, and char-sets and such. I don't see a user-agent header being sent with you curl request. I'll bet if you just did
Code:
curl http://www.pprbd.org/PublicAccess/Pe...essSearch.aspx
or
Code:
wget http://www.pprbd.org/PublicAccess/Pe...essSearch.aspx
You would get what libcurl retrieves
Adding this code will make the server think you are Firefox
Code:
struct curl_slist *slist=NULL; // Linked list of request headers
slist = curl_slist_append(slist, "ACCEPT: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5");
slist = curl_slist_append(slist, "ACCEPT_CHARSET: ISO-8859-1,utf-8;q=0.7,*;q=0.7");
slist = curl_slist_append(slist, "ACCEPT_ENCODING: gzip,deflate");
slist = curl_slist_append(slist, "ACCEPT_LANGUAGE: en-gb,en;q=0.5");
slist = curl_slist_append(slist, "CONNECTION: keep-alive");
slist = curl_slist_append(slist, "KEEP_ALIVE: 300");
curl_easy_setopt(curl, CURLOPT_HTTPHEADER, slist); // Set these headers
const char *useragent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.16) Gecko/20080702 Firefox/2.0.0.16";
curl_easy_setopt(curl, CURLOPT_USERAGENT, useragent); // Useragent is the last thing to fake
-
June 6th, 2011, 12:53 AM
#5
Re: Libcurl question
Thank you so much for your help! I'll test it as soon as I get to work in the morning, but I'm sure that will work great. You have no idea how much I'll learn from analyzing this stuff, I'm really excited.
Last edited by Pyarone; June 6th, 2011 at 01:07 AM.
-
June 6th, 2011, 02:08 PM
#6
Re: Libcurl question
If you are on Windows, download a tool called Fiddler, it'll help you understand how web requests work, as well as debug the ones you make without having to go through tons of esoteric logs.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|