Jump to content

Recommended Posts

For some reason I'm having a hard time finding regex to check for a street address and phone number. What I'm doing is crawing websites and trying to extract all this information.

 

<?php

preg_match_all('REGEX', $this->markup, $phone);

 

I'm using file_get_contents, then searching through that to extract the information.

 

For the phone number I want to extract as many US formats as I can.

(xxx) xxx-xxxx

xxx xxx xxxx

xxxxxxxxxx

 

And for the address, I just want to extract everything from the street address to the zip code.

 

EX.

7492 Street Name, city, state zip

 

I would greatly appreciate any help with this!

The problem with addresses is the format can vary too much, and they're not that different to any other text.

 

Absolutely. Humans are far superior to computers in making "judgements" on interpreting that type of input. An address - as entered by a person in a forum post - can be very dynamic. A human can identify the address because we do more than just intidentify the individual characters on the page. We can make judgements based upon position, capitalization and look at the contect in which the data was displayed.

 

A computer can only make those determinations based upon the information you give it. Or, you can build a program that learns - but that might require you to go back to school and get a Phd in artificial intelligence.

 

You  have a couple of options.

 

1) Scour the internet to see if anyone has something that will do an adequate job. Although I would expect most solutions would require a payment

 

2) Build your own solution. This will require some time an effort on your part. You would need to come up with the rules and create the code to apply it. The most difficult part will be in coming up with the rules to ensure you have the most success at finding the addresses without any false positives.

 

Here is a common way to write an address:

 

123 Main St., Anywhere, CA 12345

 

Now, you could create a rule to find instances int he text where a number exists and then capture everything to the next number. But, that would return too many false positives. Example:

I hit 3 home runs in a game in 2009 at my school

 

So, you have to keep making the rule more specific, but the more specific you make it the greater chance real addresses won't be found.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.