Jump to content

Recommended Posts

Hi

First time poster, long time reader.

 

Background:

 

I want to extract different URLs from websites by using get_content() + regex.

The amount of URLs differ from website to website.

 

Status Quo:

 

I have the result in a variable. But when I use print_r I get the result as an array inside another array, this is my code:

 

$html = file_get_contents($html);

preg_match_all('/'http:\/\/website.com\/.+#1/', $html , $strng);

print_r ($strng);

 

Result:

 Array ( [0] => Array ( [0] => 'http://website.com/first#1 [1] => 'http://slideshop.com/website-second#1  ) )

 

Question:

 

How can i print the result as a list when the array is inside another array?

 

 

 

Thank you in advance

 

:D

 

 

 

 

 

Link to comment
https://forums.phpfreaks.com/topic/241887-print-preg_match_all-result/
Share on other sites

to print out each individual result, you would have something like

 

$html = file_get_contents($html);

preg_match_all('/'http:\/\/website.com\/.+#1/', $html , $strng);

print $strng[0][0];
print $strng[0][1]; //etc...

 

or you can use a foreach loop

 

foreach ($strng as $val) {
    echo $val[0] . "\n";
    echo $val[1] . "\n";
    echo $val[2] . "\n";
    echo $val[3] . "\n";
    echo $val[4] . "\n";
}

 

The top level array represents the capture groups, and the inner arrays are the items in each capture group. Since you don't use a capture group in your pattern you only get the resulting match of the entire pattern. Therefore if you wish to loop through all of the strings that matched the entire pattern you can use ...

 

foreach($strng[0] as $url) {
    echo $url;
}

I'm not entirely sure what you're really trying to do here, because your preg_match_all as-is produces a syntax error, and you also say you are trying to grab urls from a website but have a specific url named in the regex.  Are you trying to regex for urls that match a specific domain?

 

In any case, whether or not it's all links or specific domain, your regex will match for the URL wherever it appears on the page, from a viewsource PoV.  So if it is for instance in a javascript variable or displayed on-page as plain-text (not a link), your regex is going to match it.  You should include in your pattern to match it only if it appears in anchor tag href attribute.

 

Also, it looks like you're attempting and exact "head" match... is that what you really want? for instance, your regex will match http://www.somesite.com but not https://www.somesite.com or www.somesite.com

 

And overall, you shouldn't really be using regex to parse for links or other html to begin with.  You should instead use DOM.  Here is a function to grab link urls on a page:

 

function get_links($page,$url=false) { 
  $xml = new DOMDocument(); 
  $xml->loadHTML($page); 
  $links = array(); 
  foreach($xml->getElementsByTagName('a') as $link) { 
    $href = $link->getAttribute('href');
    if ( $url ) {
      if ( strcasecmp($url, parse_url($href,PHP_URL_HOST)) == 0 )
        $links[] = $href; 
    } else {
      $links[] = $href;
    }
  } 
  return $links; 
} // end get_links

 

 

get all links

// get the page content
$page = file_get_contents('http://www.somesite.com');

// get all the links
$links = get_links($page);

echo "<pre>";print_r($links); echo "</pre>";

 

output:

Array
(
    [0] => http://www.somelink.com
    [1] => /
    [2] => /some/path/to/file.html
    [3] => http://www.someotherlink.com/a/b/c/d.html
)

 

get only links from www.xyz.com

// get the page content
$page = file_get_contents('http://www.somesite.com');

// get all the links
$links = get_links($page,'www.xyz.com');

echo "<pre>";print_r($links); echo "</pre>";

 

output:

Array
(
    [0] => http://www.xyx.com
    [1] => http://www.xyz.com/some/file.html
)

 

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.