-
February 17th, 2004, 06:26 PM
#1
Split/Parse House Number From Address String
Hi Everyone,
I would like to split and parse the house number out of the address field.
Could anyone give me some advice how to split them?
Here are some examples:
=============================================
INPUT
=============================================
301-A BROADWAY DR
3009 BROADWAY DR
31-11-06 BROADWAY DR
33A-27-1 BROADWAY DR
33B-1 BROADWAY DR
4/5 BROADWAY DR
BLOK L32 205 BROADWAY DR 4
BT 19 3429 BROADWAY DR
BT 21-3/4 BROADWAY DR
BT 3 1/2 BROADWAY DR
C1 LOT 1995 BROADWAY DR
LOT 2353-10 BROADWAY DR
=============================================
OUTPUT
=============================================
HOUSE_NO : 301-A
STREET_NAME: BROADWAY DR
HOUSE_NO : 3009
STREET_NAME: BROADWAY DR
HOUSE_NO : 31-11-06
STREET_NAME: BROADWAY DR
HOUSE_NO : 33A-27-1
STREET_NAME: BROADWAY DR
HOUSE_NO : 33B-1
STREET_NAME: BROADWAY DR
HOUSE_NO : 4/5
STREET_NAME: BROADWAY DR
HOUSE_NO : BLOK L32 205
STREET_NAME: BROADWAY DR
HOUSE_NO : BT 19 3429
STREET_NAME: BROADWAY DR
HOUSE_NO : BT 21-3/4
STREET_NAME: BROADWAY DR
HOUSE_NO : BT 3 1/2
STREET_NAME: BROADWAY DR
HOUSE_NO : C1 LOT 1995
STREET_NAME: BROADWAY DR
HOUSE_NO : LOT 2353-10
STREET_NAME: BROADWAY DR
==============================================
Thanks in advance for any help.
William
Last edited by wsee; February 17th, 2004 at 06:56 PM.
-
February 18th, 2004, 04:04 PM
#2
if you´re using C++
Try to use strstream. It is specialized on this string stuff. You can extract the fields just using the >> extractor.
Code:
...
string aa,bb,cc;
strstream strtest ;
strtest << "aaa bbb ccc ";
strtest >> aa >> bb >> cc;
cout << "# " << aa << "# " << bb << "# " << cc;
....
if you´re using C you may use fscanf with some additional work. Look at the fscanf threads and you will find how to use it.
Regards
Rabelo
-
February 18th, 2004, 04:29 PM
#3
All the your strings ended with BROADWAY DR, if i were you, I would think of take out these two strings of chars first...
Using find() function to get them out and let the upper part stay kelw in one place..
-
February 19th, 2004, 09:22 AM
#4
I once wrote a parser to extract tool and materials from manufacturing operations written by manufacturing engineers for manufacturing aircraft such as the ATF. I don't know for sure if an early prototype of the ATF was one of them processed by my program but it probably was. My program ran in the classified area and I did not know what the data was that got processed, but it processed data that was written by people for use by people. I wrote my programs based on specifications written by a manufacturing engineer that analyzed the data. He provided me with rules specifying what keywords to search for and what formats to parse for. We ran the programs several times so he could analyze the data to determine successes and failures so that the specifications and my implementation of them could provide more accurate results. The project took about a year to complete.
Since then I have written programs that parse name and address data and other data such as reports.
So you probably will need to write your program such that it recognizes various keywords in specific formats. If it were me, I would make the processing conservative in the sense that it will be relatively strict about the formats. Attempt to reject anything that does not conform to the rules, then ruhn the program and look at the successes and failures. Analyze the failures to determine if there are rules that can be used to process more of the data. There will be some of course that is not worth processing by your program that will have to be processed by a person.
Some keywords you probably need to recgonize are BLOK, BT and LOT.
I assume this is a one-time conversion and that in the future there will be better control of the data. The rules that apply to the existing data probably depend on the data you are processing; that is, what works for data from one company or orgainzation probably does not work for data for some other company or orgainzation. The data must be analyzed for the patterns used for that data, which is probably different for other name and address data.
-
February 19th, 2004, 10:08 PM
#5
Depending on the amount of your data it might not be worthwhile to write a program to parse it. Assuming you have many thousands of addresses to parse than you might want to begin by writing an analysis program. You could do write out each word one word per record with each record having an identification data such as account number or record amber so that each word can be a linked back to the original address. When I say word I mean any data separated by spaces. Then that data could be put into a data base that a non-programmer could analyze. You could sort by of the word so that for example all occurrences of Broadway would be sorted together. You could summarize by the count of occurrences of the words. That could help you to identify key words and probably spelling errors. If spelling errors are sufficiently common then they could be corrected as part of the conversion. If errors are not common the perhaps they could be corrected manually prior to the conversion.
Regardlous of what you do, it is likely to require much more time than most people expect, especially non-programmers. Getting non-programmers involved will likely help them to understand how much work is actually involved.
-
February 24th, 2004, 02:25 PM
#6
I worked on a system for the Coast Guard once that parsed hand-typed messages that had to locate many different fields.
We used a "blackboard system", which is basically a 2-phase parsing. The first extracts the information in a mechanical parsing sort of way, and the second used a rule-based knowledge engine to through the logically parsed tidbits to decide that this number is a GPS location, that number is the third waypoint on a journey, etc.
It's a bunch of work, but it's much easier to develop and maintain in the long run than basically hard coding a bunch of special cases in C code.
His problem is not a trivial one to solve.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|