Jump to content

Recommended Posts

Again, proof please..

I can by example equally counter claim that "a crawler bot is indeed 87% regex, and 13% additional trade secret algorithms which not many people know of". Doesn't make my claim correct, despite me stating that indeed it is this or that.

Provide reputable links..evidence.

 

Okay I obviously don't know what google's crawler bot code looks like.  For all I know, they could indeed have their crawler bot do some ranking evaluation right there on the spot.  But I wouldn't bet on that.  It certainly wouldn't adhere to what we like to call "good programming practice."

 

I'm just saying, in general, that's what crawler bots do.  They have a very specific function: go in and pull out info based on specific regex(es).  I mean, it's like going to a form on a site and me saying that 99% of this form's job is to take information, and you saying "nuh-uh, prove it, it could be doing other things."  Well sure, it could, but it's not the so-called "standard."

 

Perhaps we are still mis-communicating.  Look at what I've been talking to Daniel about.  When I say a crawler is 99% regex, I mean that it is 99% about pattern recognition.  Yes, it probably does have "trade secret" algorithms, actual code with conditions and loops, not just a simple preg_match(pattern), but it's all for filtering the data, so that when it's done, it reports back to the script that sent it out. Those "trade secret" algorithms are part of the regex in the sense that it is pattern recognition.  That's why I was trying to explain it to Dan from a broader PoV. 

  • Replies 79
  • Created
  • Last Reply

Top Posters In This Topic

So yes, a crawler bot is indeed 99% regex

 

Again, proof please..

I can by example equally counter claim that "a crawler bot is indeed 87% regex, and 13% additional trade secret algorithms which not many people know of". Doesn't make my claim correct, despite me stating that indeed it is this or that.

Provide reputable links..evidence.

 

He has his own proprietary definition of regex, so in a sense he can easily claim that it is, with absolute certainty, X% regex.

One a small side note with regards to spiders / SEO algorithms... this is a constantly moving target. Stuff that once worked, like Meta Keyword tags that got abused to no end are now apparently null and void.. thus spiders had to be revamped to take this into consideration as a small example. HTML 5 as another example (I am assuming here) will alter the way spiders operate and the content they report back? I still can't get over the quote from the HTML 5 editor that full recommendation will be only around the year 2022!

 

"It is estimated, again by the editor, that HTML5 will reach a W3C recommendation in the year 2022 or later. This will be approximately 18-20 years of development, since beginning in mid-2004. That's actually not that crazy, though. Work on HTML4 started in the mid 90s, and HTML4 still, more than ten years later, hasn't reached the level that we want to reach with HTML5"

 

So if I understand this correctly,we as developers will be able to start using HTML 5 much sooner, but it will take quite some time for this system to fully mature, which in turn leads me to believe that Google and friends will constantly have to keep altering their algorithms to keep up as newer specs start emerging. Good grief, I would hate to have to be the one to keep tabs on this stuff. On a more depressing note, by the time HTML 6 or 7 is out at this rate, I'll be dead :(

Yes, crawlers, scrapers, etc.. constantly have to be updated.  Call it regex, call it patterns, call it regex and patterns, call it whatever; things are constantly changing.  Standards constantly changing.  People constantly interpreting them in their own way.  People constantly changing site layout, so the regex no longer works as expected.  That's why it's not really ideal to come to a site like this, asking for a regex to grab xyz from some page.  Tomorrow or next week, it may no longer be applicable.  Better to take the time to learn it yourself.

Yes, crawlers, scrapers, etc.. constantly have to be updated.  Call it regex, call it patterns, call it regex and patterns, call it whatever; things are constantly changing.  Standards constantly changing.  People constantly interpreting them in their own way.  People constantly changing site layout, so the regex no longer works as expected.  That's why it's not really ideal to come to a site like this, asking for a regex to grab xyz from some page.  Tomorrow or next week, it may no longer be applicable.  Better to take the time to learn it yourself.

 

Agreed.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.