pocobueno1388 Posted June 23, 2010 Share Posted June 23, 2010 For some reason I'm having a hard time finding regex to check for a street address and phone number. What I'm doing is crawing websites and trying to extract all this information. <?php preg_match_all('REGEX', $this->markup, $phone); I'm using file_get_contents, then searching through that to extract the information. For the phone number I want to extract as many US formats as I can. (xxx) xxx-xxxx xxx xxx xxxx xxxxxxxxxx And for the address, I just want to extract everything from the street address to the zip code. EX. 7492 Street Name, city, state zip I would greatly appreciate any help with this! Quote Link to comment Share on other sites More sharing options...
ZachMEdwards Posted June 24, 2010 Share Posted June 24, 2010 $phone = '/\(?\b[0-9]{3}\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}\b/'; That matches all the major phone number formattings. Quote Link to comment Share on other sites More sharing options...
Adam Posted June 24, 2010 Share Posted June 24, 2010 The problem with addresses is the format can vary too much, and they're not that different to any other text. Quote Link to comment Share on other sites More sharing options...
Psycho Posted June 24, 2010 Share Posted June 24, 2010 The problem with addresses is the format can vary too much, and they're not that different to any other text. Absolutely. Humans are far superior to computers in making "judgements" on interpreting that type of input. An address - as entered by a person in a forum post - can be very dynamic. A human can identify the address because we do more than just intidentify the individual characters on the page. We can make judgements based upon position, capitalization and look at the contect in which the data was displayed. A computer can only make those determinations based upon the information you give it. Or, you can build a program that learns - but that might require you to go back to school and get a Phd in artificial intelligence. You have a couple of options. 1) Scour the internet to see if anyone has something that will do an adequate job. Although I would expect most solutions would require a payment 2) Build your own solution. This will require some time an effort on your part. You would need to come up with the rules and create the code to apply it. The most difficult part will be in coming up with the rules to ensure you have the most success at finding the addresses without any false positives. Here is a common way to write an address: 123 Main St., Anywhere, CA 12345 Now, you could create a rule to find instances int he text where a number exists and then capture everything to the next number. But, that would return too many false positives. Example: I hit 3 home runs in a game in 2009 at my school So, you have to keep making the rule more specific, but the more specific you make it the greater chance real addresses won't be found. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.