Crawler returning odd links

sharpxs · August 28, 2012

Hey there,

I am about to publish my PHP website and just had a friend run a crawler over my staging server.

The crawler returned many odd links where folders were duplicated multiple times.

My folder structure looks like this -

~/index.htm

~/service/<several service-related html files>

~/img/*.png

~/hardware/<several hardware-related html files>

And the crawler returned virtually hundreds of links (for both source and destination URL) that look like this:

http://www.mydomain.com/service/service/img/img/img/img/hardware/file.htm

http://www.mydomain.com/service/service/service/img/img/impressum.htm

http://www.mydomain.com/service/service/service/img/img/img/img/img/img/logos_small/logo.png

(This last link, which is an image file, is referring to facebook.com; at least that's what the crawler returns. None of the logos I have linked are supposed to refer to facebook )

Oddly, these links all work (crawler says status code 200)!! I can paste them in the address bar of my browser and the file actually shows up! It just doesn't load the css.

Does anybody have an idea what might cause this odd behaviour? I have never seen this before:-/.

Thanks much in advance.

Cheers,

Lars

requinix · August 28, 2012

What framework does the site run on? Did you make it yourself? Can you find out where the spider is getting those URLs from (ie, the referring page)?

sharpxs · August 28, 2012

Hey there,

I built it myself, no framework used. The spider lists those odd links as both source and destination. To date, the site isn't linked anywhere. They're all internal references.

Interestingly, I just ran a spider myself (probably another one than what was used before) and no odd links found :-/.

Let me check what software my friend was using.

Cheers,

Lars

requinix · August 28, 2012

If it's your framework... well, even if it wasn't... then debug through it as it tries to serve one of those URLs. Figure out why it thinks they're valid and fix it.

sharpxs · August 28, 2012

Not sure I'm getting what you mean. However, the strange thing is that my debugger doesn't find any issues. Then again, I ran the spider on my localhost (exact same copy of the website) and didn't find any such issues at all. Totally weird. Seems that this link checker (it's called Xenu) gets into some kind of loop - I am redesigning an existing website; currently, it's published to staging; most of the existing news articles (replicated to staging) contain references to the live site (the one in old design). Somehow that spider gets to the live site and goes back to staging, which doesn't make any sense, because the staging isn't linked on live:(.

Am still investigating, running tests in isolation. I'll let you know when I've figured out what's going on...

Christian F. · August 28, 2012

Sounds like the issue is more with your production server, than with your code. Especially since you cannot replicate the issue locally.

I'd have a look at the server, if I were you. Chances are you'll find the reason for this issue there.

Sign In

Crawler returning odd links

Recommended Posts

sharpxs

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

sharpxs

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

sharpxs

Link to comment

Share on other sites

Christian F.

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information