Jump to content

Crawler returning odd links


sharpxs

Recommended Posts

Hey there,

 

I am about to publish my PHP website and just had a friend run a crawler over my staging server.

The crawler returned many odd links where folders were duplicated multiple times.

 

My folder structure looks like this -

~/index.htm

~/service/<several service-related html files>

~/img/*.png

~/hardware/<several hardware-related html files>

 

And the crawler returned virtually hundreds of links (for both source and destination URL) that look like this:

 

http://www.mydomain.com/service/service/img/img/img/img/hardware/file.htm

http://www.mydomain.com/service/service/service/img/img/impressum.htm

http://www.mydomain.com/service/service/service/img/img/img/img/img/img/logos_small/logo.png

(This last link, which is an image file, is referring to facebook.com; at least that's what the crawler returns. None of the logos I have linked are supposed to refer to facebook :()

 

Oddly, these links all work (crawler says status code 200)!! I can paste them in the address bar of my browser and the file actually shows up! It just doesn't load the css.

 

Does anybody have an idea what might cause this odd behaviour? I have never seen this before:-/.

 

Thanks much in advance.

Cheers,

Lars

Link to comment
Share on other sites

Hey there,

 

I built it myself, no framework used. The spider lists those odd links as both source and destination. To date, the site isn't linked anywhere. They're all internal references.

Interestingly, I just ran a spider myself (probably another one than what was used before) and no odd links found :-/.

Let me check what software my friend was using.

 

Cheers,

Lars

Link to comment
Share on other sites

Not sure I'm getting what you mean. However, the strange thing is that my debugger doesn't find any issues. Then again, I ran the spider on my localhost (exact same copy of the website) and didn't find any such issues at all. Totally weird. Seems that this link checker (it's called Xenu) gets into some kind of loop - I am redesigning an existing website; currently, it's published to staging; most of the existing news articles (replicated to staging) contain references to the live site (the one in old design). Somehow that spider gets to the live site and goes back to staging, which doesn't make any sense, because the staging isn't linked on live:(.

Am still investigating, running tests in isolation. I'll let you know when I've figured out what's going on...

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.