Jump to content


This topic is now archived and is closed to further replies.


PHP Problem

Recommended Posts

Hey everyone, I've got a challenge for anyone out there reading. It's not extremely hard, but for me I can't figure it out.

For this set of assignment, assume existence of a virtual library containing the abstracts (i.e. short descriptions) of scientific publications. Your task will be to index the documents of the library to make them searchable for full-text retrieval.

You will implement the entire retrieval engine as a web application with PHP.

Preprocessing the documents' contents for full-text retrieval

Any retrieval or search engine relies on a document index which stores the occurrences of each term (words), removing irrelevant terms (stop word removal), and reducing the relevant ones to their case-insentive stem form (stemming). For example, parsing the following text:

"Frequently experienced problems include unavailability of the data in an appropriate form and lack of tools and approaches for its evaluation."

This would produce the following Tokens:


Let us implement a tokenizer for splitting full texts into tokens, proceeding step -by-step. You can put all your functions into one file Functions.inc and then include this file by all other scripts that use any of its functions (for details see [a href=\"http://us3.php.net/manual/en/function.include.php)\" target=\"_blank\"]http://us3.php.net/manual/en/function.include.php)[/a].

a) As a warm-up, write a PHP script publications.php that simply lists all the documents (i.e. doc_id, author, title, and year)

b) Write a PHP function text2tokens($some_text) that splits the input text (variable $some_text) into single low-case words and returning them as an array. Show how your function performs by writing a dynamic webpage tokenzied.php that shows the original text of each document along with an alphabetical list of words.

c) Write a PHP function removeStopWords($token) that receives an array of tokens (variable $tokens) and returns an array containing only those tokens, not in the stopword list (download the stop-word list). Modify the script tokenize.php from the previous step to display stopword-free tokenzation of all documents.

d) The final step is to implement a function which outputs the frequency of a word that occurs in each document. Develop a table such as this:

doc_id word frequency
1 evaluation 1
1 child 3
2 biology 1
2 camera 1
2 egg 2

I have completed part A B and C, but am having difficulties figuring out part D. I have code examples if anyone is interested in trying this out, and helping me to understan how to solve this problem. I am new with PHP so forgive me if this is a dumb question.

Share this post

Link to post
Share on other sites


Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.