ignace Posted October 3, 2012 Share Posted October 3, 2012 I am creating a job website, specific for web developers (me!) as a fun learning side project, that crawls other job websites, analyzes the content and tags each job post (eg. Drupal, Wordpress, Zend framework, Symfony, ..). It's the tagging that I am unsure of how I should proceed: As an experiment I wrote this: // remove noise $text = preg_replace('/[^a-z\\d\\s,.]/i', '', $job->getTextDescription()); $frontend = 'html\\d?|css\\d?|javascript|js|jquery|xml|xpath|mobile|seo'; $backend = 'php|(my|ms)?sql(ite)?'; $formats = 'json|xml'; $webservices = 'soap|rest'; $frameworks = 'drupal|facebook|wordpress|zend\\s?framework\\s?\\d?|symfony\\s?(\\d|components)?'; $tags = array($frontend, $backend, $formats, $webservices, $frameworks); preg_match_all( '/\\b('. implode('|', $tags) .')\\b/i', $text, $matches ); $tags = array(); foreach (array_unique($matches[0]) as $tagName) { $tag = new Tag(); $tag->setName($tagName); $tags[] = $tag; } return $tags; As you can see I am trying to tag as wide as possible eg. I would match: zendframework, zendframework2, zend framework2, zend framework 2 I am not asking for help with code, just the concept of tagging content in the best way possible. Anyone knows of a better way to tag content then using weird ass regexes like: zend\\s?framework\\s?\\d? Quote Link to comment https://forums.phpfreaks.com/topic/269032-analyzing-and-auto-tagging-content/ Share on other sites More sharing options...
Christian F. Posted October 3, 2012 Share Posted October 3, 2012 That's not weird, it's a rather plain and simple RegExp. That is, unless you're considering all Regular Expressions being "weird", in which case I'm afraid you just have to bite the bullet. Only other way is to set up an array with each and every single possible permutation of the words you're looking for, something that will quickly become an unwieldy mammoth of an array. A situation which Regular Expressions were made to explicitly prevent. Thus my answer is: Not really, you're using the best (only) method already. Quote Link to comment https://forums.phpfreaks.com/topic/269032-analyzing-and-auto-tagging-content/#findComment-1382500 Share on other sites More sharing options...
Adam Posted October 3, 2012 Share Posted October 3, 2012 From my experience, agencies try to use web terminology but show they have no idea what its real meaning is by saying things like "the PHP is essential" in the advert. I think if you try to match every possible wording they will come up with you'll be at it forever. I think you should search for smaller, simple string matches like "zend" and "symfony" that translate into tags like "zend-framework" and "symfony2". Plus if you over complicate them you're going to end up with multiple different versions of the same tag, which I think kind of defeats the point in tagging together content? Quote Link to comment https://forums.phpfreaks.com/topic/269032-analyzing-and-auto-tagging-content/#findComment-1382519 Share on other sites More sharing options...
ignace Posted October 3, 2012 Author Share Posted October 3, 2012 Ok. Thank you. Quote Link to comment https://forums.phpfreaks.com/topic/269032-analyzing-and-auto-tagging-content/#findComment-1382527 Share on other sites More sharing options...
xylex Posted October 3, 2012 Share Posted October 3, 2012 I'm not sure that the regex matching for the presence of the terms is going to be very valuable. For example, if the listing refers to a URL that ends in /contact.php, that would match as well as one that specifies "looking for a developer with a strong PHP background." Any thoughts of using the Solr extension and Lucene to analyze for relevancy & tagging? Seems like this would be an ideal use case for that. Quote Link to comment https://forums.phpfreaks.com/topic/269032-analyzing-and-auto-tagging-content/#findComment-1382575 Share on other sites More sharing options...
ignace Posted October 4, 2012 Author Share Posted October 4, 2012 (edited) .php is not matched hence the \\b on both sides in my regex. Another reason .php is not matched is because it simply is not in there. As you can see $job->getTextDescription(); which returns only text, not html. I store a stripped version of the HTML too though. Solr and Lucene are both full-text search engines and don't really fit my purpose, though I may use them for something else, as every word is then a tag. I would like to constrain it more to a set of tags. Edited October 4, 2012 by ignace Quote Link to comment https://forums.phpfreaks.com/topic/269032-analyzing-and-auto-tagging-content/#findComment-1382606 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.