Jump to content

print Preg_match_all result


drizac

Recommended Posts

Hi

First time poster, long time reader.

 

Background:

 

I want to extract different URLs from websites by using get_content() + regex.

The amount of URLs differ from website to website.

 

Status Quo:

 

I have the result in a variable. But when I use print_r I get the result as an array inside another array, this is my code:

 

$html = file_get_contents($html);

preg_match_all('/'http:\/\/website.com\/.+#1/', $html , $strng);

print_r ($strng);

 

Result:

 Array ( [0] => Array ( [0] => 'http://website.com/first#1 [1] => 'http://slideshop.com/website-second#1  ) )

 

Question:

 

How can i print the result as a list when the array is inside another array?

 

 

 

Thank you in advance

 

:D

 

 

 

 

 

Link to comment
https://forums.phpfreaks.com/topic/241887-print-preg_match_all-result/
Share on other sites

to print out each individual result, you would have something like

 

$html = file_get_contents($html);

preg_match_all('/'http:\/\/website.com\/.+#1/', $html , $strng);

print $strng[0][0];
print $strng[0][1]; //etc...

 

or you can use a foreach loop

 

foreach ($strng as $val) {
    echo $val[0] . "\n";
    echo $val[1] . "\n";
    echo $val[2] . "\n";
    echo $val[3] . "\n";
    echo $val[4] . "\n";
}

 

The top level array represents the capture groups, and the inner arrays are the items in each capture group. Since you don't use a capture group in your pattern you only get the resulting match of the entire pattern. Therefore if you wish to loop through all of the strings that matched the entire pattern you can use ...

 

foreach($strng[0] as $url) {
    echo $url;
}

I'm not entirely sure what you're really trying to do here, because your preg_match_all as-is produces a syntax error, and you also say you are trying to grab urls from a website but have a specific url named in the regex.  Are you trying to regex for urls that match a specific domain?

 

In any case, whether or not it's all links or specific domain, your regex will match for the URL wherever it appears on the page, from a viewsource PoV.  So if it is for instance in a javascript variable or displayed on-page as plain-text (not a link), your regex is going to match it.  You should include in your pattern to match it only if it appears in anchor tag href attribute.

 

Also, it looks like you're attempting and exact "head" match... is that what you really want? for instance, your regex will match http://www.somesite.com but not https://www.somesite.com or www.somesite.com

 

And overall, you shouldn't really be using regex to parse for links or other html to begin with.  You should instead use DOM.  Here is a function to grab link urls on a page:

 

function get_links($page,$url=false) { 
  $xml = new DOMDocument(); 
  $xml->loadHTML($page); 
  $links = array(); 
  foreach($xml->getElementsByTagName('a') as $link) { 
    $href = $link->getAttribute('href');
    if ( $url ) {
      if ( strcasecmp($url, parse_url($href,PHP_URL_HOST)) == 0 )
        $links[] = $href; 
    } else {
      $links[] = $href;
    }
  } 
  return $links; 
} // end get_links

 

 

get all links

// get the page content
$page = file_get_contents('http://www.somesite.com');

// get all the links
$links = get_links($page);

echo "<pre>";print_r($links); echo "</pre>";

 

output:

Array
(
    [0] => http://www.somelink.com
    [1] => /
    [2] => /some/path/to/file.html
    [3] => http://www.someotherlink.com/a/b/c/d.html
)

 

get only links from www.xyz.com

// get the page content
$page = file_get_contents('http://www.somesite.com');

// get all the links
$links = get_links($page,'www.xyz.com');

echo "<pre>";print_r($links); echo "</pre>";

 

output:

Array
(
    [0] => http://www.xyx.com
    [1] => http://www.xyz.com/some/file.html
)

 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.