Jump to content

address regex, more or less done


naffets77

Recommended Posts

So i've created a regex for addresses but i'm pretty newb and would like to know if there's anything i can do to improve it, or any pitfalls anyone sees. This thing needs to be able to handle just about any formatting.

 


$states =     "Alabama|AL|Alaska|AK|Arizona|AZ|Arkansas|AR|California|CA"; // ... etc
$streetSuffix = "ALLEE|ALLEY|ALLY|ALY|ANEX|ANNEX|ANNX|ANX|ARC|ARCADE|AV|AVE|AVEN"; // ...etc

/*
	([0-9]{1,6})			: location #'s
	[ ]+
	([\w ]+) : 			 : any number of words for the street name
	\s+							
	($streetSuffix)\.?		: street suffix, maybe be followed by a .
	\s+?
	([\w\s]+),?			 : any secondary info (apt, po box) and City, parse later
	\s+
	(".$states.").?			  : state
	\s+
	([0-9]{5}-[0-9]{4}|[0-9]{5})	: zip

*/



$regEx= "/([0-9]{1,6})[ ]+([\w]+)\s+($streetSuffix)\.?\s+?([\w\s]+),?\s+(".$states.").?\s+([0-9]{5}-[0-9]{4}|[0-9]{5})/i";

 

 

Feel free to grab this if you need an address parser, and if you find any bugs let me know.

 

Thanks for any suggestions!

 

Link to comment
Share on other sites

So I'm testing out different cases, and I figure there's a possibility a phone number might be listed before an address :

 

619.299.8996 2828 CAMINO DEL RIO SOUTH  SAN DIEGO, CA. 92108

 

and i'm getting as a result

 

8996 2828 CAMINO DEL RIO SOUTH  SAN DIEGO, CA. 92108

 

basically it's being broken down as

 

addressNumber: 8996

StreetName    : 2828 CAMINO DEL RIO SOUTH

 

and the rest is parsed correctly..

 

Any ideas???

 

 

Link to comment
Share on other sites

  • This is only for the US?
  • You don't need to repeat the pattern to comment it. Use the /x modifier.
  • You need anchors if you're only validating an address string.
  • You should combine your alternations for efficiency; for example: (?:A[KLRZ]|Ala(?:ska|bama))
  • \w includes an underscore--I doubt any addresses have this.
  • What about hyphenated street names? Apartments specified with a #? Apartments with letters in them, e.g., 1R?
  • You can reduce ([0-9]{5}-[0-9]{4}|[0-9]{5}) to \d{5}(?:-\d{4})?
  • If you know Perl, Geo::PostalAddress may be of interest.

Link to comment
Share on other sites

    * This is only for the US?

 

      For now I think US only.

 

    * You don't need to repeat the pattern to comment it. Use the /x modifier.

 

      Didn't know this, thanks.

 

    * You need anchors if you're only validating an address string.

 

      So I'm assuming you mean an ^ at the beginning somewhere? How will this help exactly? For the problem with the phone number before the address, that would still occur?

      I tried

 ^([0-9]{1,6})

but that just kills the regex altogether and doesn't give me any results.. not sure why.

 

    * You should combine your alternations for efficiency; for example:

      (?:A[KLRZ]|Ala(?:ska|bama))

 

      This one I know about, going to prob go through and do that at the very end once i get the actual expression working.

 

    * \w includes an underscore--I doubt any addresses have this.

 

      What should I use instead, just the [a-zA-Z0-9] thing?

 

    * What about hyphenated street names? Apartments specified with a #? Apartments with letters in them, e.g., 1R?

 

      So basically can't use the \w for this either, need to explicitly state all the characters?

 

    * You can reduce ([0-9]{5}-[0-9]{4}|[0-9]{5}) to \d{5}(?:-\d{4})?

 

      Will do, thanks.

 

    * If you know Perl, Geo::PostalAddress may be of interest.

 

      Checking it out.

 

 

Thanks for the response!

Link to comment
Share on other sites

Using the anchor stuff makes the regex not work? Maybe i'm not using it correctly..

 

$regEx = "/^([0-9]{1,6})[ ]+([\w ]+)\s+($streetSuffix)\.?\s+?([\w\s]+),?\s+(".$states.").?\s+([0-9]{5}-[0-9]{4}|[0-9]{5})$/i";

 

That's saying that the first thing needs to be a number 0-9 right, and the last thing needs to be the either/or zip number?

Using that though, doesn't get me any results using this as my test data

 

$data = " 19481 San Jose Ave. Apt 21 City of Industry, CA 91748 619.299.8996 2828 CAMINO DEL RIO SOUTH   SAN DIEGO, CA. 92108 uh  4001 West Pacific Coast Hwy.        Newport Beach, CA 92663" 

 

I checked out the postal products, they looked interesting but at this point I think I've almost got exactly what I need.

 

Thanks

Link to comment
Share on other sites

oooooh, yah sorry I didn't make that clear. So is it even possible then to know how to handle stuff before the address like a phone number? I can't ever know what will before the address for sure, but I would think there's away to say okay there are several 5 digit numbers, but grab the last one first.. and then continue on for the street name and stuff?

 

I've actually come across another difficult problem. I'm storing if there is an apt and number and the city as a single string, and i want to parse those apart as the next step.

 

I'm using

 

$regEx = "/(\w*[ ]*\w*)[\s]+(.+)/i";

 

but my problem is that I'm assuming that if there is an apt/pobox etc it's only going to be two words.. ex. apt 3, apt B. But then I don't know how long a city name might be, so i want to grab anything after the two words as the city. This works fine if there is an apt and #/Letter. However if it's just a city name it doesn't work, because it tries and break up the city name.. Not sure how to handle that sorta thing. 

Link to comment
Share on other sites

So is it even possible then to know how to handle stuff before the address like a phone number? I can't ever know what will before the address for sure, but I would think there's away to say okay there are several 5 digit numbers, but grab the last one first.. and then continue on for the street name and stuff?

 

You can search for it and make it optional in case it doesn't appear:[tt] (?:pattern)?

 

I've actually come across another difficult problem.

 

Can you supply some data samples for this?

Link to comment
Share on other sites

using the addresses : "19482 San Jose Ave. Apt 21 City of Industry, CA 91748 and 2828 CAMINO DEL RIO SOUTH  SAN DIEGO, CA. 92108"

 

parsing this results in

 

    [4] => Array

        (

            [0] => Apt 21 City of Industry

            [1] =>  SAN DIEGO

            [2] =>        Newport Beach

        )

 

and then I'm trying to use another Regular Expression to parse the Apt 21 from the City name

 

	$regEx = "/(\w*[ ]*\w*)[\s]+(.+)/i";	

 

 

results in

 

Array

(

    [0] => Array

        (

            [0] => Apt 21 City of Industry

        )

 

    [1] => Array

        (

            [0] => Apt 21

        )

 

    [2] => Array

        (

            [0] => City of Industry

        )

 

)

Array

(

    [0] => Array

        (

            [0] =>  SAN DIEGO

        )

 

    [1] => Array

        (

            [0] =>  SAN

        )

 

    [2] => Array

        (

            [0] => DIEGO

        )

 

)

 

Not sure how to handle this because I don't want to put a limit on the word count of a city. I am assuming that there will only be two elements for the apt i.e. apt 21 .. etc.. It works fine when there is an apt, otherwise it breaks up the city incorrectly.

 

Link to comment
Share on other sites

<pre>
<?php

### Data was parsed from http://www.usps.com/ncsc/lookups/abbreviations.html
### and patterns built with Perl's Regexp::Assemble.
$states = <<<STATES
	(??:N(?:[CDHJMVY]|E(?:W (?:HAMPSHIRE|JERSEY|MEXICO|YORK)|(?:BRASK|VAD)A)?|ORTH(?: (?:CAROLIN|DAKOT)A|ERN MARIANA ISLANDS))|M(?:[DEHNPST]|A(?:R(?:SHALL ISLANDS|YLAND)|SSACHUSETTS|INE)?|I(?:SS(?:ISSIPP|OUR)I|NNESOTA|CHIGAN)?|O(?:NTANA)?)|A(?:[KSZ]|L(?:A(?:BAM|SK)A)?|R(?:KANSAS|IZONA)?|MERICAN SAMOA)|F(?:EDERATED STATES OF MICRONESIA|L(?:ORIDA)?|M)|I(?:L(?:LINOIS)?|N(?:DIANA)?|D(?:AHO)?|(?:OW)?A)|C(?(?:NNECTICUT|LORADO)?|(?:ALIFORNI)?A|T)|V(?:I(?:RGIN(?: ISLANDS|IA))?|(?:ERMON)?T|A)|P(?:[RW]|ENNSYLVANIA|UERTO RICO|A(?:LAU)?)|D(?:ISTRICT OF COLUMBIA|(?:ELAWAR)?E|C)|O(?:K(?:LAHOMA)?|R(?:EGON)?|H(?:IO)?)|S(?:[CD]|OUTH (?:CAROLIN|DAKOT)A)|K(??:ENTUCK)?Y|(?:ANSA)?S)|T(?:[NX]|E(?:NNESSEE|XAS))|G(??:EORGI)?A|U(?:AM)?)|R(?:HODE ISLAND|I)|L(?:OUISIAN)?A|H(?:AWAI)?I|UT(?:AH)?)
	|W(??:A(?:SHINGTON)?|I(?:SCONSIN)?|EST VIRGINIA|V)
	|Y(?:OMING
	)?))
STATES;

$street_suffixes = <<<SUFFIXES
	(?:C(?:R(?:[KT]|E(?:S(??:C?EN)?T)?|CENT|EK)|S(??:C?N)?T|E(?:NT)?|SI?NG)|OSS(?:ROAD|ING)|CLE?)?|O(?:R(?:NERS?|S)?|UR(?:TS?|SE)|MMON|VES?)|A(?:USE?WAY|NYO?N|MP|PE)|IR(?:C(?:L(?:ES?)?)?|S)?|EN(?:T(?:ERS?|RE?)?)?|L(?:IFFS?|FS?|U?B)|N(?:TE?R|YN)|T(?:RS?|S)?|M[NP]|URVE?|PE?|SWY|VS?|YN|K)|S(?:T(?:[NS]|R(?:[MT]|A(?:V(?:E(?:N(?:UE)?)?|N)?)?|E(?:ETS?|AM|ME)|VN(?:UE)?)?|A(?:T(?:IO)?N)?)?|H(?(?:A(?:LS?|RS?)|RES?)|LS?|RS?)|P(?:R(?:INGS?|NGS?)|NGS?|URS?|GS?)|Q(?:U(?:ARES?)?|R[ES]?|S)?|(?:UM(?:IT?|MI)|M)T|K(?:YWA|W)Y)|P(?:A(?:RK(?:W(?:AYS?|Y)|S)?|SS(?:AGE)?|THS?)|L(?:A(?:IN(?:E?S)?|CE|ZA)|NS?|ZA?)?|R(?:[KR]|AI?RIE|TS?)?|K(?:W(?:YS?|AY)|Y)?|O(?:INTS?|RTS?)|I(?:KES?|NES?)|NES?|SGE|TS?)|B(?(?:UL(?:EVARD|V)?|T(?:TO?M)?)|R(?:A?NCH|I?DGE|OOKS?|KS?|G)?|Y(?(?:A(?:S?S)?|S)?|U)|L(?:UF(?:FS?)?|FS?|VD)|E(?:ACH|ND)|AYO[OU]|URGS?|GS?|CH|ND|TM)|M(?(?:UNT(?:AINS?|IN)?|TORWAY)|N(?:T(?:AIN|NS?)?|RS?)|E(??:DO)?WS|ADOWS?)|I(?:SS(?:IO)?N|LLS?)|T(?:NS?|IN|WY)?|A(?:NORS?|LL)|DWS?|S?SN|LS?)|T(?:R(?:A(?:C(?:ES?|KS?)|FFICWAY|ILS?|K)|[FW]Y|N?PK|KS?|LS?|CE)?|U(?:N(?:N(?:ELS?|L)|LS?|EL)|RNP(?:IKE|K))|ER(?:R(?:ACE)?)?|HROUGHWAY|PKE?)|R(?:A(?(??:I[AE])?L)?|NCH(?:ES)?|PIDS?|MP)|I(?:V(?:E?R)?|DGES?)|O(?:ADS?|UTE|W)|D(?:G[ES]?|S)?|NCHS?|U[EN]|E?ST|PDS?|TE|VR)|F(?:R(??:(?:EE)?WA?|R)?Y|DS?|GS?|KS?|S?T)|OR(?:G(?:ES?)?|ESTS?|DS?|KS?|T)|L(?:ATS?|DS?|TS?|S)|(?:ERR|W)Y|IELDS?|ALLS?|T)|H(?:A(?:RB(?:ORS?|R)?|VE?N)|I(??:GH)?WA?Y|LLS?)|OL(?:LOWS?|WS?)|L(?:LW|S)?|EIGHTS?|BRS?|RBOR|WA?Y|GTS|TS?|VN)|V(?:I(?:LL(?:AG(?:ES?)?|(?:IAG)?E|G)?|A(?:DU?CT)?|S(?:TA?)?|EWS?)|L(?:GS?|YS?|LY)?|ALL(?:EYS?|Y)|STA?|DCT|WS?)|G(?:R(?(?:NS?|EN)|OV(?:ES?)?|EENS?|NS?|VS?)|A(?:T(?:EWA?|WA)Y|RD(?:ENS?|N))|L(?:ENS?|NS?)|TWA?Y|DNS?)|L(?:A(?:N(?(?:ING)?|ES?)|KES?)?|O(?:CKS?|DGE?|OPS?|AF)|N(?:DN?G)?|IGHTS?|CKS?|DGE?|GTS?|KS?|F)|E(?:X(?(?:[WY]|R(?:ESS(?:WAY)?)?)?|T(??:NS)?N|ENSIONS?|S)?)|ST(?:ATES?|S)?)|A(?:V(?:E(?:N(?:UE?)?)?|N(?:UE)?)?|L(?:L(?:E[EY]|Y)|Y)|RC(?:ADE)?|NN?E?X)|D(?:[LM]|R(?:[sV]|IV(?:ES?)?)?|IV(?:IDE)?|A(?:LE|M)|VD?)|J(?:UNCT(?:IONS?|O?N)|CT(?:ION|NS?|S)?)|I(?:S(?:L(?:ANDS?|NDS?|ES?)|S)?|NLE?T)|O(?:V(?:ERPASS|A?L)|RCH(?:A?RD)?|PAS)|W(?:A(?:L(?:KS?|L)|YS?)|ELLS?|LS?|Y)|K(?:N(?:OL(?:LS?)?|LS?)|EYS?|YS?)|U(?:N(??:DERPAS)?S|IONS?)?|PAS)|X(?:ING|RD)|NE?CK)
SUFFIXES;

$unit_desig = <<<UNIT_DESIG
	(?:S(??:UIT|ID)E|P(?:ACE|C)|T(?:OP|E)|LIP)|B(??:ASEMEN|SM)T|(?:UILDIN|LD)G)|(?:H(?:ANGA|NG)|TR(?:AILE|L))R|L(?(?:WE?R|BBY|T)|BBY)|(?:DE|A)P(?:ARTMEN)?T|F(?:L(?:OOR)?|RO?NT)|P(?:ENTHOUSE|IER|H)|R(??:OO)?M|EAR)|U(?:PPE?R|NIT)|OF(?:FICE|C))
UNIT_DESIG;

$data = '19482 San Jose Ave. Apt 21 City of Industry, CA 91748 and 2828 CAMINO DEL RIO SOUTH   SAN DIEGO, CA. 92108';

$regex= "/
	(\d{1,6}) ### Number.
	\s+
	([A-Z\s]+?) ### Street.
	\s+
	($street_suffixes\.?) ### Suffix.
	\s+
	(?: ### Optional Unit.
		(
			$unit_desig
			\.?
			\s+
			\#?[A-Z\d]+
		)
		,?
	)?
	\s+
	([A-Z\s]+) ### City.
	,?
	\s+
	($states)
	\.?
	\s+
	(\d{5}(?:-\d{4})?) ### Zip Code.
/xi";

preg_match_all($regex, $data, $matches);
print_r($matches);
?>
</pre>

 

Returns:

Array

(

    [0] => Array

        (

            [0] => 19482 San Jose Ave. Apt 21 City of Industry, CA 91748

        )

 

    [1] => Array

        (

            [0] => 19482

        )

 

    [2] => Array

        (

            [0] => San Jose

        )

 

    [3] => Array

        (

            [0] => Ave.

        )

 

    [4] => Array

        (

            [0] => Apt 21

        )

 

    [5] => Array

        (

            [0] => City of Industry

        )

 

    [6] => Array

        (

            [0] => CA

        )

 

    [7] => Array

        (

            [0] => 91748

        )

 

)

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.