print Preg_match_all result

drizac · July 13, 2011

Hi

First time poster, long time reader.

Background:

I want to extract different URLs from websites by using get_content() + regex.

The amount of URLs differ from website to website.

Status Quo:

I have the result in a variable. But when I use print_r I get the result as an array inside another array, this is my code:

$html = file_get_contents($html);

preg_match_all('/'http:\/\/website.com\/.+#1/', $html , $strng);

print_r ($strng);

Result:

 Array ( [0] => Array ( [0] => 'http://website.com/first#1 [1] => 'http://slideshop.com/website-second#1  ) )

Question:

How can i print the result as a list when the array is inside another array?

Thank you in advance

D

AyKay47 · July 13, 2011

to print out each individual result, you would have something like

$html = file_get_contents($html);

preg_match_all('/'http:\/\/website.com\/.+#1/', $html , $strng);

print $strng[0][0];
print $strng[0][1]; //etc...

or you can use a foreach loop

foreach ($strng as $val) {
    echo $val[0] . "\n";
    echo $val[1] . "\n";
    echo $val[2] . "\n";
    echo $val[3] . "\n";
    echo $val[4] . "\n";
}

cags · July 13, 2011

The top level array represents the capture groups, and the inner arrays are the items in each capture group. Since you don't use a capture group in your pattern you only get the resulting match of the entire pattern. Therefore if you wish to loop through all of the strings that matched the entire pattern you can use ...

foreach($strng[0] as $url) {
    echo $url;
}

drizac · July 15, 2011

Thanks a lot, that really helped

Is there a point award system somewhere?

.josh · July 15, 2011

I'm not entirely sure what you're really trying to do here, because your preg_match_all as-is produces a syntax error, and you also say you are trying to grab urls from a website but have a specific url named in the regex. Are you trying to regex for urls that match a specific domain?

In any case, whether or not it's all links or specific domain, your regex will match for the URL wherever it appears on the page, from a viewsource PoV. So if it is for instance in a javascript variable or displayed on-page as plain-text (not a link), your regex is going to match it. You should include in your pattern to match it only if it appears in anchor tag href attribute.

Also, it looks like you're attempting and exact "head" match... is that what you really want? for instance, your regex will match http://www.somesite.com but not https://www.somesite.com or www.somesite.com

And overall, you shouldn't really be using regex to parse for links or other html to begin with. You should instead use DOM. Here is a function to grab link urls on a page:

function get_links($page,$url=false) { 
  $xml = new DOMDocument(); 
  $xml->loadHTML($page); 
  $links = array(); 
  foreach($xml->getElementsByTagName('a') as $link) { 
    $href = $link->getAttribute('href');
    if ( $url ) {
      if ( strcasecmp($url, parse_url($href,PHP_URL_HOST)) == 0 )
        $links[] = $href; 
    } else {
      $links[] = $href;
    }
  } 
  return $links; 
} // end get_links

get all links

// get the page content
$page = file_get_contents('http://www.somesite.com');

// get all the links
$links = get_links($page);

echo "<pre>";print_r($links); echo "</pre>";

output:

Array
(
    [0] => http://www.somelink.com
    [1] => /
    [2] => /some/path/to/file.html
    [3] => http://www.someotherlink.com/a/b/c/d.html
)

get only links from www.xyz.com

// get the page content
$page = file_get_contents('http://www.somesite.com');

// get all the links
$links = get_links($page,'www.xyz.com');

echo "<pre>";print_r($links); echo "</pre>";

output:

Array
(
    [0] => http://www.xyx.com
    [1] => http://www.xyz.com/some/file.html
)

Sign In

print Preg_match_all result

Recommended Posts

drizac

Link to comment

Share on other sites

AyKay47

Link to comment

Share on other sites

cags

Link to comment

Share on other sites

drizac

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information