Changing arrays inside loops

MaxMouseDLL · August 19, 2009

Say i have a foreach loop, which is dependant on $arr... if i add other elements to $arr while still inside the foreach loop, will foreach process the newly created elements or drop out at what the $arr upper boundary condition was before entering the foreach loop?

If it doesn't process elements added while inside the foreach, then what would be a work around for this?

rhodesa · August 19, 2009

off the top of my head, i'm not sure if foreach will process them or not...but (if it's a numerically indexed array) you can always do:

for($n=0;$n < count($arr);$n++){

}

mikesta707 · August 19, 2009

Nope, it will drop out at the boundry before the foreach loop started. so the following will only output 1-10

$arr = array(1, 2,3,4,5,6,7,8,9,10);

foreach($arr as $ass){
$arr[] = $ass + 1;
        echo $ass;
}

Output: 12345678910

However, the following will show the full array

$arr = array(1, 2,3,4,5,6,7,8,9,10);

foreach($arr as $ass){
$arr[] = $ass + 1;
}

foreach ($arr as $ass){
echo $ass . "<br />";
}

1

2

3

4

5

6

7

8

9

10

2

3

4

5

6

7

8

9

10

11

however, rhodesa's example does keep going if you add more and more to the array, but the following example

for($n=0;$n < count($arr);$n++){
$arr[] = $n + 1;
echo $arr[$n] . "<br />";
}

results in an endless loop, so make sure you don't add something to the array at every pass

MaxMouseDLL · August 19, 2009

My code is a PHP Web Spider, so it may or may not add an element (or more than one element) per loop...

EG:

http://yoursite.com may produce 300 extra elements

http://yoursite.com/map.php may produce (say) 10 elements

http://yoursite.com/about.php may produce none

All variables are unknown...

How do i avoid an infinite loop, yet process an ever changing array upper boundary until completion... i hate recursion lol!

mikesta707 · August 19, 2009

recursion is awesome don't be a hater =P

But how exactly does your script work? I've never used a web spider before so I would probably have to see the inner code to give any advice. depending on how the loop works, it might might just prevent an infinite loop by itself. You could set a cap for how many times it can loop (IE if i > say 5000, break;)

MaxMouseDLL · August 19, 2009

Its function based, so i pass it a URL (EG: http://www.something.com ) it returns an array containing all the links on that page, each one of those links will also need to be spidered for other links.

So what i need to do is pass it a link, which returns all the links contained within that then begin "dynamic looping" if it returns index.html, hello.html, whatever.html it'll need to spider those three for links, and whatever links it finds within those and so on, the number of links returned is arbitrary and unknown, hence the loop needs to pay attention to the ever growing array rather than just the upper boundary of the array when it began executing.

If you would like to see the code i can provide it. I'm not a hater i have just never got on with recursion... i tend to try and visualise the whole thing which is pretty much impossible and ends up bogging my brain down.

thebadbad · August 19, 2009

Well, recursion would be the obvious solution to that. But you would have to set a limit, telling the script how deep it should go.

kratsg · August 19, 2009

Well, recursion would be the obvious solution to that. But you would have to set a limit, telling the script how deep it should go.

Not just recursion, but you also have to check for repeated links. Let's say you spider a website that uses a menu for its links, each page will have that menu of links which means without really going deep at all, you're kind of infinitely looping.

MaxMouseDLL · August 19, 2009

It should go as deep as possible (IE: index all links) because I'm going to lock it to the domain, and duplicates will be removed. I think I'm going to have to sit and stare at this one for a while.

The idea is to cron job the script, or atleast password it so i can execute it whenever i see fit, and from the output generate a sitemap.xml, one will be generated daily from this. From there the domain will be a filler (EG: <domain>) and i'll use another script to change (str_replace) that filler to whatever domain i see fit, that way where ever my site code is deployed a sitemap will be readily available.

thebadbad · August 19, 2009

Well, recursion would be the obvious solution to that. But you would have to set a limit, telling the script how deep it should go.

Not just recursion, but you also have to check for repeated links. Let's say you spider a website that uses a menu for its links, each page will have that menu of links which means without really going deep at all, you're kind of infinitely looping.

Would be a fine feature, yes, but not necessary. The script should still stop at e.g. step 10. You could then just remove repeated links from the array afterwards.

A thing you would need however is a function to convert relative URLs to absolute URLs:

<?php
//http://w-shadow.com/blog/2007/07/16/how-to-extract-all-urls-from-a-page-using-php/
function relative2absolute($absolute, $relative) {
        $p = @parse_url($relative);
        if(!$p) {
        //$relative is a seriously malformed URL
        return false;
        }
        if(isset($p["scheme"])) return $relative;

        $parts=(parse_url($absolute));

        if(substr($relative,0,1)=='/') {
            $cparts = (explode("/", $relative));
            array_shift($cparts);
        } else {
            if(isset($parts['path'])){
                 $aparts=explode('/',$parts['path']);
                 array_pop($aparts);
                 $aparts=array_filter($aparts);
            } else {
                 $aparts=array();
            }
           $rparts = (explode("/", $relative));
           $cparts = array_merge($aparts, $rparts);
           foreach($cparts as $i => $part) {
                if($part == '.') {
                    unset($cparts[$i]);
                } else if($part == '..') {
                    unset($cparts[$i]);
                    unset($cparts[$i-1]);
                }
            }
        }
        $path = implode("/", $cparts);

        $url = '';
        if($parts['scheme']) {
            $url = "$parts[scheme]://";
        }
        if(isset($parts['user'])) {
            $url .= $parts['user'];
            if(isset($parts['pass'])) {
                $url .= ":".$parts['pass'];
            }
            $url .= "@";
        }
        if(isset($parts['host'])) {
            $url .= $parts['host']."/";
        }
        $url .= $path;

        return $url;
}
?>

And to be a 100% sure, you should also check every page for a base tag, and use the value as the first parameter to the above function if found.

kratsg · August 21, 2009

I wonder if you include subdomains as well or just restrict it to either http://www.example.com/ or http://example.com/ (both of these are quite different and may be a pain in the long run when it comes to getting absolute URLs.)

Perhaps simply spidering using relative URLs, and then rewinding up the array and making everything absolute.

IE:

Site Root/Public_Html domain

- contains an array of files and folders

-- folders contain an array of files and folders included in it

So, as you rewind back to the top level of the array, you build up the relative path...

Sign In

Changing arrays inside loops

Recommended Posts

MaxMouseDLL

Link to comment

Share on other sites

rhodesa

Link to comment

Share on other sites

mikesta707

Link to comment

Share on other sites

MaxMouseDLL

Link to comment

Share on other sites

mikesta707

Link to comment

Share on other sites

MaxMouseDLL

Link to comment

Share on other sites

thebadbad

Link to comment

Share on other sites

kratsg

Link to comment

Share on other sites

MaxMouseDLL

Link to comment

Share on other sites

thebadbad

Link to comment

Share on other sites

kratsg

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information