Jump to content

Recommended Posts

Say i have a foreach loop, which is dependant on $arr... if i add other elements to $arr while still inside the foreach loop, will foreach process the newly created elements or drop out at what the $arr upper boundary condition was before entering the foreach loop?

 

If it doesn't process elements added while inside the foreach, then what would be a work around for this?

Link to comment
https://forums.phpfreaks.com/topic/171009-changing-arrays-inside-loops/
Share on other sites

Nope, it will drop out at the boundry before the foreach loop started. so the following will only output 1-10

$arr = array(1, 2,3,4,5,6,7,8,9,10);

foreach($arr as $ass){
$arr[] = $ass + 1;
        echo $ass;
}

Output: 12345678910

 

However, the following will show the full array

 

$arr = array(1, 2,3,4,5,6,7,8,9,10);

foreach($arr as $ass){
$arr[] = $ass + 1;
}

foreach ($arr as $ass){
echo $ass . "<br />";
}

 

1

2

3

4

5

6

7

8

9

10

2

3

4

5

6

7

8

9

10

11

 

however, rhodesa's example does keep going if you add more and more to the array, but the following example

 

for($n=0;$n < count($arr);$n++){
$arr[] = $n + 1;
echo $arr[$n] . "<br />";
}

 

results in an endless loop, so make sure you don't add something to the array at every pass

My code is a PHP Web Spider, so it may or may not add an element (or more than one element) per loop...

 

EG:

 

http://yoursite.com may produce 300 extra elements

http://yoursite.com/map.php may produce (say) 10 elements

http://yoursite.com/about.php may produce none

 

All variables are unknown...

 

How do i avoid an infinite loop, yet process an ever changing array upper boundary until completion... i hate recursion lol!

recursion is awesome don't be a hater =P

 

But how exactly does your script work? I've never used a web spider before so I would probably have to see the inner code to give any advice. depending on how the loop works, it might might just prevent an infinite loop by itself. You could set a cap for how many times it can loop (IE if i > say 5000, break;)

Its function based, so i pass it a URL (EG: http://www.something.com ) it returns an array containing all the links on that page, each one of those links will also need to be spidered for other links.

 

So what i need to do is pass it a link, which returns all the links contained within that then begin "dynamic looping" if it returns index.html, hello.html, whatever.html it'll need to spider those three for links, and whatever links it finds within those and so on, the number of links returned is arbitrary and unknown, hence the loop needs to pay attention to the ever growing array rather than just the upper boundary of the array when it began executing.

 

If you would like to see the code i can provide it. I'm not a hater i have just never got on with recursion... i tend to try and visualise the whole thing which is pretty much impossible and ends up bogging my brain down.

Well, recursion would be the obvious solution to that. But you would have to set a limit, telling the script how deep it should go.

 

Not just recursion, but you also have to check for repeated links. Let's say you spider a website that uses a menu for its links, each page will have that menu of links which means without really going deep at all, you're kind of infinitely looping.

It should go as deep as possible (IE: index all links) because I'm going to lock it to the domain, and duplicates will be removed. I think I'm going to have to sit and stare at this one for a while.

 

The idea is to cron job the script, or atleast password it so i can execute it whenever i see fit, and from the output generate a sitemap.xml, one will be generated daily from this. From there the domain will be a filler (EG: <domain>) and i'll use another script to change (str_replace) that filler to whatever domain i see fit, that way where ever my site code is deployed a sitemap will be readily available.

Well, recursion would be the obvious solution to that. But you would have to set a limit, telling the script how deep it should go.

 

Not just recursion, but you also have to check for repeated links. Let's say you spider a website that uses a menu for its links, each page will have that menu of links which means without really going deep at all, you're kind of infinitely looping.

 

Would be a fine feature, yes, but not necessary. The script should still stop at e.g. step 10. You could then just remove repeated links from the array afterwards.

 

A thing you would need however is a function to convert relative URLs to absolute URLs:

 

<?php
//http://w-shadow.com/blog/2007/07/16/how-to-extract-all-urls-from-a-page-using-php/
function relative2absolute($absolute, $relative) {
        $p = @parse_url($relative);
        if(!$p) {
        //$relative is a seriously malformed URL
        return false;
        }
        if(isset($p["scheme"])) return $relative;

        $parts=(parse_url($absolute));

        if(substr($relative,0,1)=='/') {
            $cparts = (explode("/", $relative));
            array_shift($cparts);
        } else {
            if(isset($parts['path'])){
                 $aparts=explode('/',$parts['path']);
                 array_pop($aparts);
                 $aparts=array_filter($aparts);
            } else {
                 $aparts=array();
            }
           $rparts = (explode("/", $relative));
           $cparts = array_merge($aparts, $rparts);
           foreach($cparts as $i => $part) {
                if($part == '.') {
                    unset($cparts[$i]);
                } else if($part == '..') {
                    unset($cparts[$i]);
                    unset($cparts[$i-1]);
                }
            }
        }
        $path = implode("/", $cparts);

        $url = '';
        if($parts['scheme']) {
            $url = "$parts[scheme]://";
        }
        if(isset($parts['user'])) {
            $url .= $parts['user'];
            if(isset($parts['pass'])) {
                $url .= ":".$parts['pass'];
            }
            $url .= "@";
        }
        if(isset($parts['host'])) {
            $url .= $parts['host']."/";
        }
        $url .= $path;

        return $url;
}
?>

 

And to be a 100% sure, you should also check every page for a base tag, and use the value as the first parameter to the above function if found.

I wonder if you include subdomains as well or just restrict it to either http://www.example.com/ or http://example.com/ (both of these are quite different and may be a pain in the long run when it comes to getting absolute URLs.)

 

Perhaps simply spidering using relative URLs, and then rewinding up the array and making everything absolute.

 

IE:

 

Site Root/Public_Html domain

- contains an array of files and folders

-- folders contain an array of files and folders included in it

 

So, as you rewind back to the top level of the array, you build up the relative path...

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.