rafaelm Posted July 12, 2010 Share Posted July 12, 2010 Hello, I am having problems with an array I am getting from a web page scrape. This is the code of the page I'm scraping. Nothing out of the norm here: <br /><br /> <h2>Links</h2> <h3> <a href="http://www.link.com/?d=E8ON0ESE" target="_blank">http://www.link.com/?q=E8ON0ESE</a><br /> <a href="http://www.link.com/?d=LVZTY0TE" target="_blank">http://www.link.com/?q=LVZTY0TE</a><br /> <a href="http://www.link.com/?d=8ZYJY3MC" target="_blank">http://www.link.com/?q=8ZYJY3MC</a><br /> <a href="http://www.link.com/?d=FJ3W6QAB" target="_blank">http://www.link.com/?q=FJ3W6QAB</a><br /> <br /> And this is the code I'm using to get the links off the site: $data = file_get_contents('http://www.example.com/index.html'); preg_match_all('/>http:\/\/www.link\.com\/.*?</', $data, $mulinks); print_r($mulinks); For some reason I am getting this in the array: Array ( [0] => Array ( [0] => >http://www.link.com/?q=E8ON0ESE< [1] => >http://www.link.com/?q=LVZTY0TE< [2] => >http://www.link.com/?q=8ZYJY3MC< [3] => >http://www.link.com/?q=FJ3W6QAB< ) ) I am still learning PHP but I'm confused as to why it would create the array in this way. It seems to me the code is creating an array, and then inside that array it has another one nested inside? The problem comes up when I try to access a particular index inside the array: echo $mulinks[2]; should print the third value from the array, but I get this error: Notice: Undefined offset: 2 in C:\Program Files\EasyPHP5.2.10\www\script.php on line 58 I am guessing that the problem is that there is no actual index key [2] in the array. Instead (the way I see it), the [2] is inside the first array I'm confused as to why the array is created this way? I know there has to be a way to access the values in the array as it is currently, but I want to create the array correctly. Thanks, Rafael Quote Link to comment Share on other sites More sharing options...
Psycho Posted July 12, 2010 Share Posted July 12, 2010 You should read the manual for preg_match_all() - specifically on how the matches are assigned to the $matches parameter. There are a number of optional flags you can set to determine the format of the results. http://us3.php.net/manual/en/function.preg-match-all.php Also, you can improve your pattern so you don't get the leading and ending <> characters: preg_match_all('/<a[^>]*>(http:\/\/www.link\.com\/[^<]*)<\/a>/', $data, $mulinks); Output: Array ( [0] => Array ( [0] => <a href="http://www.link.com/?d=E8ON0ESE" target="_blank">http://www.link.com/?q=E8ON0ESE</a> [1] => <a href="http://www.link.com/?d=LVZTY0TE" target="_blank">http://www.link.com/?q=LVZTY0TE</a> [2] => <a href="http://www.link.com/?d=8ZYJY3MC" target="_blank">http://www.link.com/?q=8ZYJY3MC</a> [3] => <a href="http://www.link.com/?d=FJ3W6QAB" target="_blank">http://www.link.com/?q=FJ3W6QAB</a> ) [1] => Array ( [0] => http://www.link.com/?q=E8ON0ESE [1] => http://www.link.com/?q=LVZTY0TE [2] => http://www.link.com/?q=8ZYJY3MC [3] => http://www.link.com/?q=FJ3W6QAB ) ) Quote Link to comment Share on other sites More sharing options...
rafaelm Posted July 12, 2010 Author Share Posted July 12, 2010 Thanks for your help mjdamato! I read the the php.net manual page but I'm still not getting the array this way: Array ( [0] => >http://www.link.com/?q=E8ON0ESE< [1] => >http://www.link.com/?q=LVZTY0TE< [2] => >http://www.link.com/?q=8ZYJY3MC< [3] => >http://www.link.com/?q=FJ3W6QAB< ) I've been searching and I can't seem to find a specific reason why the array isn't created this way. I've worked with arrays before and it's the first time I've seen this. For the regex, that's the best I could get it to do hehe... I wanted to match what was inside the > and < so it would get the links. Your example matches both the entire "<a href... " link and the URLs and generates two separate arrays. I am still trying to learn regex, but it's not so easy :-\ Quote Link to comment Share on other sites More sharing options...
Psycho Posted July 12, 2010 Share Posted July 12, 2010 Again, you need to read the manual for preg_match_all(). You won't get the results in the format you are wanting because it doesn't have any option for doing so. The default behavior of preg_match_all() is the same as if you used the PREG_PATTERN_ORDER flag. In that format, the results are returned in a multi-dimensional array where the first indexc contains an array of the "full" pattern matches and the second index contains an array of the values matched by the first parenthesized subpattern. That is what I did with the example I provided. I included a full pattern that matches everything from the opening A tag to the ending A tag. But, I included a sub-pattern in parenthesis that matched ONLY the exact data you are wanting - the URL. The results are contained in a multi-dimensional array as I showed above. So, if you want to reference just the URLs you would do so like this: first URL = $mulinks[1][0], second URL = $mulinks[1][1], third URL = $mulinks[1][2], etc. If you want just a single dimensional array of just the URLs, then you can define a new variable using: $justURLs = $mulinks[1] Assuming you are usng a parenthesised sub query such as I provided. Quote Link to comment Share on other sites More sharing options...
rafaelm Posted July 12, 2010 Author Share Posted July 12, 2010 Ah yes, now that I re-read the php.net manual page I see how preg_match_all works. Your explanation is much more clear though. Thanks, you've been great help! Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.