|
-
September 17th, 2010, 04:19 AM
#1
Downloading web page source code
I use .net 3.5 with windows vista. I've been trying to download a certain web page in order to extract some data. I opened my browser (firefox), loaded the web page and selected "view page source" and finally saved it. I got the file somename.aspx. Then I wrote code that recognised certain strings in the code like "rgb (247, 10, 15)" and got the data that followed those strings. Then I wrote some code to download the web page from c# but the problem is that what I get an unformatted text that contains characters like \n or \r and those rgbs converted to "color:#d42d24" or something like that.
Code:
string file;
Console.WriteLine ("Getting data from web page");
Uri webFile=new Uri ("http://www.druglist.gr/drugs.aspx?title=A");
HttpWebRequest request=(HttpWebRequest) WebRequest.Create (webFile);
request.Method="GET";
WebResponse response=request.GetResponse ();
StreamReader stream=new StreamReader (response.GetResponseStream(), Encoding.GetEncoding ("Utf-8"));
file=stream.ReadToEnd ();
stream.Close ();
So the problem is that when I use firefox to get the web page the first line reads:
Code:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
but when I use C# and store the page in the string "file" I get this:
Code:
\r\n\r\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\r\n\r\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\r\n
If I use: Console.WriteLine(file) then I get the "correct" format that I want.
But how can I convert the web page from one format to the other?
-
September 19th, 2010, 03:08 AM
#2
Re: Downloading web page source code
 Originally Posted by dourak
I use .net 3.5 with windows vista. I've been trying to download a certain web page in order to extract some data. I opened my browser (firefox), loaded the web page and selected "view page source" and finally saved it. I got the file somename.aspx. Then I wrote code that recognised certain strings in the code like "rgb (247, 10, 15)" and got the data that followed those strings. Then I wrote some code to download the web page from c# but the problem is that what I get an unformatted text that contains characters like \n or \r and those rgbs converted to "color:#d42d24" or something like that.
Code:
string file;
Console.WriteLine ("Getting data from web page");
Uri webFile=new Uri ("http://www.druglist.gr/drugs.aspx?title=A");
HttpWebRequest request=(HttpWebRequest) WebRequest.Create (webFile);
request.Method="GET";
WebResponse response=request.GetResponse ();
StreamReader stream=new StreamReader (response.GetResponseStream(), Encoding.GetEncoding ("Utf-8"));
file=stream.ReadToEnd ();
stream.Close ();
So the problem is that when I use firefox to get the web page the first line reads:
Code:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
but when I use C# and store the page in the string "file" I get this:
Code:
\r\n\r\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\r\n\r\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\r\n
If I use: Console.WriteLine(file) then I get the "correct" format that I want.
But how can I convert the web page from one format to the other?
You will have to write your own converter, there is nothing out of the box on the net for this that
i know of. IMO, webbrowsers should show the actual code of the page, instead of theyre own
interpretation, to avoid confusion.
The code your program sees, is the actual source code of the webpage you are downloading.
FF is converting it into something else, so when you write your code, write it based on the
source code your program receives.
To convert the color, in the source your program receives, you could try something like this:
Code:
string stuff = "d42d24";
int r = int.Parse(stuff.Substring(0, 2), System.Globalization.NumberStyles.AllowHexSpecifier);
int g =int.Parse(stuff.Substring(2, 2), System.Globalization.NumberStyles.AllowHexSpecifier);
int b = int.Parse(stuff.Substring(4, 2), System.Globalization.NumberStyles.AllowHexSpecifier);
string new_stuff = "\"rgb (" + r.ToString() + "," + g.ToString() + "," + b.ToString() + ")\"";
MessageBox.Show(new_stuff); // "rgb (212,45,36)"
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|