Tables for layout

.josh · December 29, 2008

Again, proof please..

I can by example equally counter claim that "a crawler bot is indeed 87% regex, and 13% additional trade secret algorithms which not many people know of". Doesn't make my claim correct, despite me stating that indeed it is this or that.

Provide reputable links..evidence.

Okay I obviously don't know what google's crawler bot code looks like. For all I know, they could indeed have their crawler bot do some ranking evaluation right there on the spot. But I wouldn't bet on that. It certainly wouldn't adhere to what we like to call "good programming practice."

I'm just saying, in general, that's what crawler bots do. They have a very specific function: go in and pull out info based on specific regex(es). I mean, it's like going to a form on a site and me saying that 99% of this form's job is to take information, and you saying "nuh-uh, prove it, it could be doing other things." Well sure, it could, but it's not the so-called "standard."

Perhaps we are still mis-communicating. Look at what I've been talking to Daniel about. When I say a crawler is 99% regex, I mean that it is 99% about pattern recognition. Yes, it probably does have "trade secret" algorithms, actual code with conditions and loops, not just a simple preg_match(pattern), but it's all for filtering the data, so that when it's done, it reports back to the script that sent it out. Those "trade secret" algorithms are part of the regex in the sense that it is pattern recognition. That's why I was trying to explain it to Dan from a broader PoV.

Daniel0 · December 29, 2008

So yes, a crawler bot is indeed 99% regex

Again, proof please..

I can by example equally counter claim that "a crawler bot is indeed 87% regex, and 13% additional trade secret algorithms which not many people know of". Doesn't make my claim correct, despite me stating that indeed it is this or that.

Provide reputable links..evidence.

He has his own proprietary definition of regex, so in a sense he can easily claim that it is, with absolute certainty, X% regex.

nrg_alpha · December 30, 2008

One a small side note with regards to spiders / SEO algorithms... this is a constantly moving target. Stuff that once worked, like Meta Keyword tags that got abused to no end are now apparently null and void.. thus spiders had to be revamped to take this into consideration as a small example. HTML 5 as another example (I am assuming here) will alter the way spiders operate and the content they report back? I still can't get over the quote from the HTML 5 editor that full recommendation will be only around the year 2022!

"It is estimated, again by the editor, that HTML5 will reach a W3C recommendation in the year 2022 or later. This will be approximately 18-20 years of development, since beginning in mid-2004. That's actually not that crazy, though. Work on HTML4 started in the mid 90s, and HTML4 still, more than ten years later, hasn't reached the level that we want to reach with HTML5"

So if I understand this correctly,we as developers will be able to start using HTML 5 much sooner, but it will take quite some time for this system to fully mature, which in turn leads me to believe that Google and friends will constantly have to keep altering their algorithms to keep up as newer specs start emerging. Good grief, I would hate to have to be the one to keep tabs on this stuff. On a more depressing note, by the time HTML 6 or 7 is out at this rate, I'll be dead

.josh · December 30, 2008

Yes, crawlers, scrapers, etc.. constantly have to be updated. Call it regex, call it patterns, call it regex and patterns, call it whatever; things are constantly changing. Standards constantly changing. People constantly interpreting them in their own way. People constantly changing site layout, so the regex no longer works as expected. That's why it's not really ideal to come to a site like this, asking for a regex to grab xyz from some page. Tomorrow or next week, it may no longer be applicable. Better to take the time to learn it yourself.

nrg_alpha · December 30, 2008

Yes, crawlers, scrapers, etc.. constantly have to be updated. Call it regex, call it patterns, call it regex and patterns, call it whatever; things are constantly changing. Standards constantly changing. People constantly interpreting them in their own way. People constantly changing site layout, so the regex no longer works as expected. That's why it's not really ideal to come to a site like this, asking for a regex to grab xyz from some page. Tomorrow or next week, it may no longer be applicable. Better to take the time to learn it yourself.

Agreed.

Sign In

Tables for layout

Recommended Posts

.josh

Link to comment

Share on other sites

Daniel0

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information