PHP Problem

kmblackbear06 · March 8, 2006

Hey everyone, I've got a challenge for anyone out there reading. It's not extremely hard, but for me I can't figure it out.

For this set of assignment, assume existence of a virtual library containing the abstracts (i.e. short descriptions) of scientific publications. Your task will be to index the documents of the library to make them searchable for full-text retrieval.

You will implement the entire retrieval engine as a web application with PHP.

Preprocessing the documents' contents for full-text retrieval

Any retrieval or search engine relies on a document index which stores the occurrences of each term (words), removing irrelevant terms (stop word removal), and reducing the relevant ones to their case-insentive stem form (stemming). For example, parsing the following text:

"Frequently experienced problems include unavailability of the data in an appropriate form and lack of tools and approaches for its evaluation."

This would produce the following Tokens:

frequent
experience
problem
include
unavailable
data
appropriate
form
lack
tool
approach
evacuation

Let us implement a tokenizer for splitting full texts into tokens, proceeding step -by-step. You can put all your functions into one file Functions.inc and then include this file by all other scripts that use any of its functions (for details see [a href=\"http://us3.php.net/manual/en/function.include.php)\" target=\"_blank\"]http://us3.php.net/manual/en/function.include.php)[/a].

a) As a warm-up, write a PHP script publications.php that simply lists all the documents (i.e. doc_id, author, title, and year)

b) Write a PHP function text2tokens($some_text) that splits the input text (variable $some_text) into single low-case words and returning them as an array. Show how your function performs by writing a dynamic webpage tokenzied.php that shows the original text of each document along with an alphabetical list of words.

c) Write a PHP function removeStopWords($token) that receives an array of tokens (variable $tokens) and returns an array containing only those tokens, not in the stopword list (download the stop-word list). Modify the script tokenize.php from the previous step to display stopword-free tokenzation of all documents.

d) The final step is to implement a function which outputs the frequency of a word that occurs in each document. Develop a table such as this:

doc_id word frequency
1 evaluation 1
1 child 3
2 biology 1
2 camera 1
2 egg 2

I have completed part A B and C, but am having difficulties figuring out part D. I have code examples if anyone is interested in trying this out, and helping me to understan how to solve this problem. I am new with PHP so forgive me if this is a dumb question.

Sign In

PHP Problem

Recommended Posts

kmblackbear06

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information