Jump to content

Analyzing And Auto Tagging Content


ignace

Recommended Posts

I am creating a job website, specific for web developers (me!) as a fun learning side project, that crawls other job websites, analyzes the content and tags each job post (eg. Drupal, Wordpress, Zend framework, Symfony, ..).

 

It's the tagging that I am unsure of how I should proceed:

 

As an experiment I wrote this:

 

// remove noise
$text = preg_replace('/[^a-z\\d\\s,.]/i', '', $job->getTextDescription());

$frontend    = 'html\\d?|css\\d?|javascript|js|jquery|xml|xpath|mobile|seo';
$backend     = 'php|(my|ms)?sql(ite)?';
$formats     = 'json|xml';
$webservices = 'soap|rest';
$frameworks  = 'drupal|facebook|wordpress|zend\\s?framework\\s?\\d?|symfony\\s?(\\d|components)?';

$tags = array($frontend, $backend, $formats, $webservices, $frameworks);

preg_match_all(
 '/\\b('. implode('|', $tags) .')\\b/i',
 $text, $matches
);

$tags = array();

foreach (array_unique($matches[0]) as $tagName) {
 $tag = new Tag();
 $tag->setName($tagName);
 $tags[] = $tag;
}

return $tags;

 

As you can see I am trying to tag as wide as possible eg. I would match: zendframework, zendframework2, zend framework2, zend framework 2

 

I am not asking for help with code, just the concept of tagging content in the best way possible. Anyone knows of a better way to tag content then using weird ass regexes like: zend\\s?framework\\s?\\d?

 

 

Link to comment
Share on other sites

That's not weird, it's a rather plain and simple RegExp. That is, unless you're considering all Regular Expressions being "weird", in which case I'm afraid you just have to bite the bullet.

 

Only other way is to set up an array with each and every single possible permutation of the words you're looking for, something that will quickly become an unwieldy mammoth of an array. A situation which Regular Expressions were made to explicitly prevent.

 

Thus my answer is: Not really, you're using the best (only) method already.

Link to comment
Share on other sites

From my experience, agencies try to use web terminology but show they have no idea what its real meaning is by saying things like "the PHP is essential" in the advert. I think if you try to match every possible wording they will come up with you'll be at it forever. I think you should search for smaller, simple string matches like "zend" and "symfony" that translate into tags like "zend-framework" and "symfony2". Plus if you over complicate them you're going to end up with multiple different versions of the same tag, which I think kind of defeats the point in tagging together content?

Link to comment
Share on other sites

I'm not sure that the regex matching for the presence of the terms is going to be very valuable. For example, if the listing refers to a URL that ends in /contact.php, that would match as well as one that specifies "looking for a developer with a strong PHP background."

 

Any thoughts of using the Solr extension and Lucene to analyze for relevancy & tagging? Seems like this would be an ideal use case for that.

Link to comment
Share on other sites

.php is not matched hence the \\b on both sides in my regex. Another reason .php is not matched is because it simply is not in there. As you can see $job->getTextDescription(); which returns only text, not html. I store a stripped version of the HTML too though.

 

Solr and Lucene are both full-text search engines and don't really fit my purpose, though I may use them for something else, as every word is then a tag. I would like to constrain it more to a set of tags.

Edited by ignace
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.