Jump to content

Recommended Posts

Hello,

I am having problems with an array I am getting from a web page scrape.

 

This is the code of the page I'm scraping. Nothing out of the norm here:

<br /><br />
<h2>Links</h2>

<h3>
<a href="http://www.link.com/?d=E8ON0ESE" target="_blank">http://www.link.com/?q=E8ON0ESE</a><br />
<a href="http://www.link.com/?d=LVZTY0TE" target="_blank">http://www.link.com/?q=LVZTY0TE</a><br />
<a href="http://www.link.com/?d=8ZYJY3MC" target="_blank">http://www.link.com/?q=8ZYJY3MC</a><br />
<a href="http://www.link.com/?d=FJ3W6QAB" target="_blank">http://www.link.com/?q=FJ3W6QAB</a><br />

<br />

 

And this is the code I'm using to get the links off the site:

 

$data = file_get_contents('http://www.example.com/index.html');

preg_match_all('/>http:\/\/www.link\.com\/.*?</', $data, $mulinks);

print_r($mulinks);

 

For some reason I am getting this in the array:

 

Array
(
    [0] => Array
        (
            [0] => >http://www.link.com/?q=E8ON0ESE<
            [1] => >http://www.link.com/?q=LVZTY0TE<
            [2] => >http://www.link.com/?q=8ZYJY3MC<
            [3] => >http://www.link.com/?q=FJ3W6QAB<
        )

)

 

I am still learning PHP but I'm confused as to why it would create the array in this way. It seems to me the code is creating an array, and then inside that array it has another one nested inside?

 

The problem comes up when I try to access a particular index inside the array:

 

echo $mulinks[2];

 

should print the third value from the array, but I get this error:

 

Notice: Undefined offset: 2 in C:\Program Files\EasyPHP5.2.10\www\script.php on line 58

 

I am guessing that the problem is that there is no actual index key [2] in the array. Instead (the way I see it), the [2] is inside the first array

 

I'm confused as to why the array is created this way? I know there has to be a way to access the values in the array as it is currently, but I want to create the array correctly.

 

Thanks,

Rafael

 

Link to comment
https://forums.phpfreaks.com/topic/207516-problem-with-an-array/
Share on other sites

You should read the manual for preg_match_all() - specifically on how the matches are assigned to the $matches parameter. There are a number of optional flags you can set to determine the format of the results.

 

http://us3.php.net/manual/en/function.preg-match-all.php

 

Also, you can improve your pattern so you don't get the leading and ending <> characters:

 

preg_match_all('/<a[^>]*>(http:\/\/www.link\.com\/[^<]*)<\/a>/', $data, $mulinks);

 

Output:

Array
(
    [0] => Array
        (
            [0] => <a href="http://www.link.com/?d=E8ON0ESE" target="_blank">http://www.link.com/?q=E8ON0ESE</a>
            [1] => <a href="http://www.link.com/?d=LVZTY0TE" target="_blank">http://www.link.com/?q=LVZTY0TE</a>
            [2] => <a href="http://www.link.com/?d=8ZYJY3MC" target="_blank">http://www.link.com/?q=8ZYJY3MC</a>
            [3] => <a href="http://www.link.com/?d=FJ3W6QAB" target="_blank">http://www.link.com/?q=FJ3W6QAB</a>
        )

    [1] => Array
        (
            [0] => http://www.link.com/?q=E8ON0ESE
            [1] => http://www.link.com/?q=LVZTY0TE
            [2] => http://www.link.com/?q=8ZYJY3MC
            [3] => http://www.link.com/?q=FJ3W6QAB
        )

)

Thanks for your help mjdamato!

 

I read the the php.net manual page but I'm still not getting the array this way:

 

Array
(
            [0] => >http://www.link.com/?q=E8ON0ESE<
            [1] => >http://www.link.com/?q=LVZTY0TE<
            [2] => >http://www.link.com/?q=8ZYJY3MC<
            [3] => >http://www.link.com/?q=FJ3W6QAB<
)

 

I've been searching and I can't seem to find a specific reason why the array isn't created this way. I've worked with arrays before and it's the first time I've seen this.

 

For the regex, that's the best I could get it to do hehe... I wanted to match what was inside the > and < so it would get the links. Your example matches both the entire "<a href... " link and the URLs and generates two separate arrays.

I am still trying to learn regex, but it's not so easy  :-\

Again, you need to read the manual for preg_match_all(). You won't get the results in the format you are wanting because it doesn't have any option for doing so.

 

The default behavior of preg_match_all() is the same as if you used the PREG_PATTERN_ORDER flag. In that format, the results are returned in a multi-dimensional array where the first indexc contains an array of the "full" pattern matches and the second index contains an array of the values matched by the first parenthesized subpattern.

 

That is what I did with the example I provided. I included a full pattern that matches everything from the opening A tag to the ending A tag. But, I included a sub-pattern in parenthesis that matched ONLY the exact data you are wanting - the URL. The results are contained in a multi-dimensional array as I showed above.

 

So, if you want to reference just the URLs you would do so like this:

 

first URL = $mulinks[1][0],

second URL = $mulinks[1][1],

third URL = $mulinks[1][2],

etc.

 

If you want just a single dimensional array of just the URLs, then you can define a new variable using:

$justURLs = $mulinks[1]

Assuming you are usng a parenthesised sub query such as I provided.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.