naffets77 Posted January 30, 2008 Share Posted January 30, 2008 So i've created a regex for addresses but i'm pretty newb and would like to know if there's anything i can do to improve it, or any pitfalls anyone sees. This thing needs to be able to handle just about any formatting. $states = "Alabama|AL|Alaska|AK|Arizona|AZ|Arkansas|AR|California|CA"; // ... etc $streetSuffix = "ALLEE|ALLEY|ALLY|ALY|ANEX|ANNEX|ANNX|ANX|ARC|ARCADE|AV|AVE|AVEN"; // ...etc /* ([0-9]{1,6}) : location #'s [ ]+ ([\w ]+) : : any number of words for the street name \s+ ($streetSuffix)\.? : street suffix, maybe be followed by a . \s+? ([\w\s]+),? : any secondary info (apt, po box) and City, parse later \s+ (".$states.").? : state \s+ ([0-9]{5}-[0-9]{4}|[0-9]{5}) : zip */ $regEx= "/([0-9]{1,6})[ ]+([\w]+)\s+($streetSuffix)\.?\s+?([\w\s]+),?\s+(".$states.").?\s+([0-9]{5}-[0-9]{4}|[0-9]{5})/i"; Feel free to grab this if you need an address parser, and if you find any bugs let me know. Thanks for any suggestions! Quote Link to comment Share on other sites More sharing options...
naffets77 Posted January 30, 2008 Author Share Posted January 30, 2008 So I'm testing out different cases, and I figure there's a possibility a phone number might be listed before an address : 619.299.8996 2828 CAMINO DEL RIO SOUTH SAN DIEGO, CA. 92108 and i'm getting as a result 8996 2828 CAMINO DEL RIO SOUTH SAN DIEGO, CA. 92108 basically it's being broken down as addressNumber: 8996 StreetName : 2828 CAMINO DEL RIO SOUTH and the rest is parsed correctly.. Any ideas??? Quote Link to comment Share on other sites More sharing options...
effigy Posted January 30, 2008 Share Posted January 30, 2008 This is only for the US? You don't need to repeat the pattern to comment it. Use the /x modifier. You need anchors if you're only validating an address string. You should combine your alternations for efficiency; for example: (?:A[KLRZ]|Ala(?:ska|bama)) \w includes an underscore--I doubt any addresses have this. What about hyphenated street names? Apartments specified with a #? Apartments with letters in them, e.g., 1R? You can reduce ([0-9]{5}-[0-9]{4}|[0-9]{5}) to \d{5}(?:-\d{4})? If you know Perl, Geo::PostalAddress may be of interest. Quote Link to comment Share on other sites More sharing options...
naffets77 Posted January 30, 2008 Author Share Posted January 30, 2008 * This is only for the US? For now I think US only. * You don't need to repeat the pattern to comment it. Use the /x modifier. Didn't know this, thanks. * You need anchors if you're only validating an address string. So I'm assuming you mean an ^ at the beginning somewhere? How will this help exactly? For the problem with the phone number before the address, that would still occur? I tried ^([0-9]{1,6}) but that just kills the regex altogether and doesn't give me any results.. not sure why. * You should combine your alternations for efficiency; for example: (?:A[KLRZ]|Ala(?:ska|bama)) This one I know about, going to prob go through and do that at the very end once i get the actual expression working. * \w includes an underscore--I doubt any addresses have this. What should I use instead, just the [a-zA-Z0-9] thing? * What about hyphenated street names? Apartments specified with a #? Apartments with letters in them, e.g., 1R? So basically can't use the \w for this either, need to explicitly state all the characters? * You can reduce ([0-9]{5}-[0-9]{4}|[0-9]{5}) to \d{5}(?:-\d{4})? Will do, thanks. * If you know Perl, Geo::PostalAddress may be of interest. Checking it out. Thanks for the response! Quote Link to comment Share on other sites More sharing options...
effigy Posted January 30, 2008 Share Posted January 30, 2008 If you don't use the ^ and $ anchors, any string will be valid as long as it contains a valid address string. Such as "!#%$Axc the address is actually here +~%^." Have you considered any address products? Quote Link to comment Share on other sites More sharing options...
naffets77 Posted January 30, 2008 Author Share Posted January 30, 2008 Using the anchor stuff makes the regex not work? Maybe i'm not using it correctly.. $regEx = "/^([0-9]{1,6})[ ]+([\w ]+)\s+($streetSuffix)\.?\s+?([\w\s]+),?\s+(".$states.").?\s+([0-9]{5}-[0-9]{4}|[0-9]{5})$/i"; That's saying that the first thing needs to be a number 0-9 right, and the last thing needs to be the either/or zip number? Using that though, doesn't get me any results using this as my test data $data = " 19481 San Jose Ave. Apt 21 City of Industry, CA 91748 619.299.8996 2828 CAMINO DEL RIO SOUTH SAN DIEGO, CA. 92108 uh 4001 West Pacific Coast Hwy. Newport Beach, CA 92663" I checked out the postal products, they looked interesting but at this point I think I've almost got exactly what I need. Thanks Quote Link to comment Share on other sites More sharing options...
effigy Posted January 30, 2008 Share Posted January 30, 2008 In your case you don't need the anchors because you're extracting addresses rather than verifying a single address, which I thought was the case. Quote Link to comment Share on other sites More sharing options...
naffets77 Posted January 31, 2008 Author Share Posted January 31, 2008 oooooh, yah sorry I didn't make that clear. So is it even possible then to know how to handle stuff before the address like a phone number? I can't ever know what will before the address for sure, but I would think there's away to say okay there are several 5 digit numbers, but grab the last one first.. and then continue on for the street name and stuff? I've actually come across another difficult problem. I'm storing if there is an apt and number and the city as a single string, and i want to parse those apart as the next step. I'm using $regEx = "/(\w*[ ]*\w*)[\s]+(.+)/i"; but my problem is that I'm assuming that if there is an apt/pobox etc it's only going to be two words.. ex. apt 3, apt B. But then I don't know how long a city name might be, so i want to grab anything after the two words as the city. This works fine if there is an apt and #/Letter. However if it's just a city name it doesn't work, because it tries and break up the city name.. Not sure how to handle that sorta thing. Quote Link to comment Share on other sites More sharing options...
effigy Posted January 31, 2008 Share Posted January 31, 2008 So is it even possible then to know how to handle stuff before the address like a phone number? I can't ever know what will before the address for sure, but I would think there's away to say okay there are several 5 digit numbers, but grab the last one first.. and then continue on for the street name and stuff? You can search for it and make it optional in case it doesn't appear:[tt] (?:pattern)? I've actually come across another difficult problem. Can you supply some data samples for this? Quote Link to comment Share on other sites More sharing options...
naffets77 Posted February 1, 2008 Author Share Posted February 1, 2008 using the addresses : "19482 San Jose Ave. Apt 21 City of Industry, CA 91748 and 2828 CAMINO DEL RIO SOUTH SAN DIEGO, CA. 92108" parsing this results in [4] => Array ( [0] => Apt 21 City of Industry [1] => SAN DIEGO [2] => Newport Beach ) and then I'm trying to use another Regular Expression to parse the Apt 21 from the City name $regEx = "/(\w*[ ]*\w*)[\s]+(.+)/i"; results in Array ( [0] => Array ( [0] => Apt 21 City of Industry ) [1] => Array ( [0] => Apt 21 ) [2] => Array ( [0] => City of Industry ) ) Array ( [0] => Array ( [0] => SAN DIEGO ) [1] => Array ( [0] => SAN ) [2] => Array ( [0] => DIEGO ) ) Not sure how to handle this because I don't want to put a limit on the word count of a city. I am assuming that there will only be two elements for the apt i.e. apt 21 .. etc.. It works fine when there is an apt, otherwise it breaks up the city incorrectly. Quote Link to comment Share on other sites More sharing options...
effigy Posted February 1, 2008 Share Posted February 1, 2008 <pre> <?php ### Data was parsed from http://www.usps.com/ncsc/lookups/abbreviations.html ### and patterns built with Perl's Regexp::Assemble. $states = <<<STATES (??:N(?:[CDHJMVY]|E(?:W (?:HAMPSHIRE|JERSEY|MEXICO|YORK)|(?:BRASK|VAD)A)?|ORTH(?: (?:CAROLIN|DAKOT)A|ERN MARIANA ISLANDS))|M(?:[DEHNPST]|A(?:R(?:SHALL ISLANDS|YLAND)|SSACHUSETTS|INE)?|I(?:SS(?:ISSIPP|OUR)I|NNESOTA|CHIGAN)?|O(?:NTANA)?)|A(?:[KSZ]|L(?:A(?:BAM|SK)A)?|R(?:KANSAS|IZONA)?|MERICAN SAMOA)|F(?:EDERATED STATES OF MICRONESIA|L(?:ORIDA)?|M)|I(?:L(?:LINOIS)?|N(?:DIANA)?|D(?:AHO)?|(?:OW)?A)|C(?(?:NNECTICUT|LORADO)?|(?:ALIFORNI)?A|T)|V(?:I(?:RGIN(?: ISLANDS|IA))?|(?:ERMON)?T|A)|P(?:[RW]|ENNSYLVANIA|UERTO RICO|A(?:LAU)?)|D(?:ISTRICT OF COLUMBIA|(?:ELAWAR)?E|C)|O(?:K(?:LAHOMA)?|R(?:EGON)?|H(?:IO)?)|S(?:[CD]|OUTH (?:CAROLIN|DAKOT)A)|K(??:ENTUCK)?Y|(?:ANSA)?S)|T(?:[NX]|E(?:NNESSEE|XAS))|G(??:EORGI)?A|U(?:AM)?)|R(?:HODE ISLAND|I)|L(?:OUISIAN)?A|H(?:AWAI)?I|UT(?:AH)?) |W(??:A(?:SHINGTON)?|I(?:SCONSIN)?|EST VIRGINIA|V) |Y(?:OMING )?)) STATES; $street_suffixes = <<<SUFFIXES (?:C(?:R(?:[KT]|E(?:S(??:C?EN)?T)?|CENT|EK)|S(??:C?N)?T|E(?:NT)?|SI?NG)|OSS(?:ROAD|ING)|CLE?)?|O(?:R(?:NERS?|S)?|UR(?:TS?|SE)|MMON|VES?)|A(?:USE?WAY|NYO?N|MP|PE)|IR(?:C(?:L(?:ES?)?)?|S)?|EN(?:T(?:ERS?|RE?)?)?|L(?:IFFS?|FS?|U?B)|N(?:TE?R|YN)|T(?:RS?|S)?|M[NP]|URVE?|PE?|SWY|VS?|YN|K)|S(?:T(?:[NS]|R(?:[MT]|A(?:V(?:E(?:N(?:UE)?)?|N)?)?|E(?:ETS?|AM|ME)|VN(?:UE)?)?|A(?:T(?:IO)?N)?)?|H(?(?:A(?:LS?|RS?)|RES?)|LS?|RS?)|P(?:R(?:INGS?|NGS?)|NGS?|URS?|GS?)|Q(?:U(?:ARES?)?|R[ES]?|S)?|(?:UM(?:IT?|MI)|M)T|K(?:YWA|W)Y)|P(?:A(?:RK(?:W(?:AYS?|Y)|S)?|SS(?:AGE)?|THS?)|L(?:A(?:IN(?:E?S)?|CE|ZA)|NS?|ZA?)?|R(?:[KR]|AI?RIE|TS?)?|K(?:W(?:YS?|AY)|Y)?|O(?:INTS?|RTS?)|I(?:KES?|NES?)|NES?|SGE|TS?)|B(?(?:UL(?:EVARD|V)?|T(?:TO?M)?)|R(?:A?NCH|I?DGE|OOKS?|KS?|G)?|Y(?(?:A(?:S?S)?|S)?|U)|L(?:UF(?:FS?)?|FS?|VD)|E(?:ACH|ND)|AYO[OU]|URGS?|GS?|CH|ND|TM)|M(?(?:UNT(?:AINS?|IN)?|TORWAY)|N(?:T(?:AIN|NS?)?|RS?)|E(??:DO)?WS|ADOWS?)|I(?:SS(?:IO)?N|LLS?)|T(?:NS?|IN|WY)?|A(?:NORS?|LL)|DWS?|S?SN|LS?)|T(?:R(?:A(?:C(?:ES?|KS?)|FFICWAY|ILS?|K)|[FW]Y|N?PK|KS?|LS?|CE)?|U(?:N(?:N(?:ELS?|L)|LS?|EL)|RNP(?:IKE|K))|ER(?:R(?:ACE)?)?|HROUGHWAY|PKE?)|R(?:A(?(??:I[AE])?L)?|NCH(?:ES)?|PIDS?|MP)|I(?:V(?:E?R)?|DGES?)|O(?:ADS?|UTE|W)|D(?:G[ES]?|S)?|NCHS?|U[EN]|E?ST|PDS?|TE|VR)|F(?:R(??:(?:EE)?WA?|R)?Y|DS?|GS?|KS?|S?T)|OR(?:G(?:ES?)?|ESTS?|DS?|KS?|T)|L(?:ATS?|DS?|TS?|S)|(?:ERR|W)Y|IELDS?|ALLS?|T)|H(?:A(?:RB(?:ORS?|R)?|VE?N)|I(??:GH)?WA?Y|LLS?)|OL(?:LOWS?|WS?)|L(?:LW|S)?|EIGHTS?|BRS?|RBOR|WA?Y|GTS|TS?|VN)|V(?:I(?:LL(?:AG(?:ES?)?|(?:IAG)?E|G)?|A(?:DU?CT)?|S(?:TA?)?|EWS?)|L(?:GS?|YS?|LY)?|ALL(?:EYS?|Y)|STA?|DCT|WS?)|G(?:R(?(?:NS?|EN)|OV(?:ES?)?|EENS?|NS?|VS?)|A(?:T(?:EWA?|WA)Y|RD(?:ENS?|N))|L(?:ENS?|NS?)|TWA?Y|DNS?)|L(?:A(?:N(?(?:ING)?|ES?)|KES?)?|O(?:CKS?|DGE?|OPS?|AF)|N(?:DN?G)?|IGHTS?|CKS?|DGE?|GTS?|KS?|F)|E(?:X(?(?:[WY]|R(?:ESS(?:WAY)?)?)?|T(??:NS)?N|ENSIONS?|S)?)|ST(?:ATES?|S)?)|A(?:V(?:E(?:N(?:UE?)?)?|N(?:UE)?)?|L(?:L(?:E[EY]|Y)|Y)|RC(?:ADE)?|NN?E?X)|D(?:[LM]|R(?:[sV]|IV(?:ES?)?)?|IV(?:IDE)?|A(?:LE|M)|VD?)|J(?:UNCT(?:IONS?|O?N)|CT(?:ION|NS?|S)?)|I(?:S(?:L(?:ANDS?|NDS?|ES?)|S)?|NLE?T)|O(?:V(?:ERPASS|A?L)|RCH(?:A?RD)?|PAS)|W(?:A(?:L(?:KS?|L)|YS?)|ELLS?|LS?|Y)|K(?:N(?:OL(?:LS?)?|LS?)|EYS?|YS?)|U(?:N(??:DERPAS)?S|IONS?)?|PAS)|X(?:ING|RD)|NE?CK) SUFFIXES; $unit_desig = <<<UNIT_DESIG (?:S(??:UIT|ID)E|P(?:ACE|C)|T(?:OP|E)|LIP)|B(??:ASEMEN|SM)T|(?:UILDIN|LD)G)|(?:H(?:ANGA|NG)|TR(?:AILE|L))R|L(?(?:WE?R|BBY|T)|BBY)|(?:DE|A)P(?:ARTMEN)?T|F(?:L(?:OOR)?|RO?NT)|P(?:ENTHOUSE|IER|H)|R(??:OO)?M|EAR)|U(?:PPE?R|NIT)|OF(?:FICE|C)) UNIT_DESIG; $data = '19482 San Jose Ave. Apt 21 City of Industry, CA 91748 and 2828 CAMINO DEL RIO SOUTH SAN DIEGO, CA. 92108'; $regex= "/ (\d{1,6}) ### Number. \s+ ([A-Z\s]+?) ### Street. \s+ ($street_suffixes\.?) ### Suffix. \s+ (?: ### Optional Unit. ( $unit_desig \.? \s+ \#?[A-Z\d]+ ) ,? )? \s+ ([A-Z\s]+) ### City. ,? \s+ ($states) \.? \s+ (\d{5}(?:-\d{4})?) ### Zip Code. /xi"; preg_match_all($regex, $data, $matches); print_r($matches); ?> </pre> Returns: Array ( [0] => Array ( [0] => 19482 San Jose Ave. Apt 21 City of Industry, CA 91748 ) [1] => Array ( [0] => 19482 ) [2] => Array ( [0] => San Jose ) [3] => Array ( [0] => Ave. ) [4] => Array ( [0] => Apt 21 ) [5] => Array ( [0] => City of Industry ) [6] => Array ( [0] => CA ) [7] => Array ( [0] => 91748 ) ) Quote Link to comment Share on other sites More sharing options...
naffets77 Posted February 1, 2008 Author Share Posted February 1, 2008 wow... that's really awesome, I'm going to test it out .. thanks! Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.