Click to See Complete Forum and Search --> : how to detect URLs in a string?


pred
May 8th, 1999, 04:01 AM
i encountered problems in detecting URL from a given string, i need to know the url's location in the string. i programmed a little program but it can't fulfill my demands.

for example, there is a string like "example: www.codeguru.com and www.microsoft.com/msn, two URLs"

then the program should be able to detect the "www.codeguru.com" and the "www.microsoft.com/msn". can anyone give me some hint? thanks a lot.

Jason Brooks
May 8th, 1999, 08:30 PM
Try using a known delimiter in you string like putting quotes or commas around each URL.

Jason
http://www.netcomuk.co.uk/~jbrooks

pred
May 8th, 1999, 08:48 PM
indeed, these strings are acquired from other data sources, not composed by me.
there may be no delimiter in these strings, and the program should be able to detect the URLs in the string.

Todd Jeffreys
May 8th, 1999, 11:41 PM
What is this, for school? Just search for "www" and if you're really brave, even check if it ends in ".com" ".edu" etc.

pred
May 9th, 1999, 02:36 AM
my god, there are thousands of combinations. if the string looks like "given URL is:enjoy.za.net, mail to me:pred@126.com",
it's hard to know there is "http://enjoy.za.net" and "mailto:pred@126.com" in the string.
but what i want is just this.

Stefan Tchekanov
May 9th, 1999, 04:30 AM
You are trying to do things very hard.
www.codeguru.com is not a URL, neither www.microsoft.com/msn is a URL.

The URLs are:
http://www.codeguru.com
and
http://www.microsoft.com/msn

There is a RFC stating how could URLs look like.
http://sunsite.auc.dk/RFC/rfc/rfc1738.html

Here are some sites where you could search for other RFCs.
http://www.ietf.org/1id-abstracts.html
http://www.globecom.net/(nocl,sv)/ietf/index.shtml
http://sunsite.auc.dk/RFC/

I hope this helps.

pred
May 9th, 1999, 04:59 AM
yes, i know what u mean, and i do know exactly the definition and the standard form of URLs.

now, i'm programming a telnet client, sometimes people post their articles including some urls, but they often failed to add "http://" or "mailto:" or something else to these urls.

i want my program can recognise all the urls without the standard form, so that when click on the kinds urls, it can launch the according program to process these urls.

that's why i ask the question.

Jason Brooks
May 9th, 1999, 06:12 AM
Then I'm afraid, your going to have to resort to some clever programming on your part. If your not pulling in from standard notation. And I suspect it's something like bulk mailer type programs, then your going to have to "go to it"!

Good luck

Matt Cawley
May 9th, 1999, 02:56 PM
I don't know if it'll be any use, but you could have a look at the MFC helper function AfxParseURL, also
the SDK functions InternetCanonicalizeUrl and InternetCrackURL.

Matt Cawley

Colin Davies
May 9th, 1999, 07:40 PM
I agree 100% with your reply Todd.
Only thing is you have to remember the international " Suffix's " as well
eg .ad .af .ag .ai .am .an .ao .aq .ar .as .at .au .aw and .az and thats just the a's :-)

<FontSize = 5 Color = "red"> At the Mount </Font>

sally
May 10th, 1999, 07:56 AM
1) URL: Any 'word' that has a character followed by a dot followed by another character is part of a URL, so www.codeguru.com satisfies my condition, but www. codeguru. com does not, and the dot at the end of this sentence is followed by a space before the next sentecnce, so that doesn't make a URL. Easy?

2) Email: Any 'word' that has a character followed by a at-symbol followed by another character is part of an email address, so sally@theworld.com is an email address, but 12 pants @ $40 each does not fit the criteria

This two 'alogritms should work

Sally

Sally
May 10th, 1999, 07:56 AM
1) URL: Any 'word' that has a character followed by a dot followed by another character is part of a URL, so www.codeguru.com satisfies my condition, but www. codeguru. com does not, and the dot at the end of this sentence is followed by a space before the next sentecnce, so that doesn't make a URL. Easy?

2) Email: Any 'word' that has a character followed by a at-symbol followed by another character is part of an email address, so sally@theworld.com is an email address, but 12 pants @ $40 each does not fit the criteria

This two 'alogritms should work

Sally

pred
May 10th, 1999, 09:42 AM
yes, we can use this method to detect urls, but let's think of some special situations, for example,
we know "http://www.codeguru.com:80/i.e c/visual c++" is a URL, if it appears in
a string like: "a example string: www.codeguru.com:80/i.e c/visual c++, some other part", it maybe quite difficult to detect the url.

sally
May 10th, 1999, 08:39 PM
you said it:

it maybe quite difficult to detect the url.

and that's the answer to this thread because you are trying to detect a pattern in a text wherer there is no pattern.

Force the users to use http:// etc, and once they realise that there URLs aren't detected, they'' start using the correct notation. mrhpf, maybe I have been using Windows and Microsoft programs for too long, hihi

Sally

Sally
May 10th, 1999, 08:39 PM
you said it:

it maybe quite difficult to detect the url.

and that's the answer to this thread because you are trying to detect a pattern in a text wherer there is no pattern.

Force the users to use http:// etc, and once they realise that there URLs aren't detected, they'' start using the correct notation. mrhpf, maybe I have been using Windows and Microsoft programs for too long, hihi

Sally