Jump to content

Recommended Posts

Hey everyone, I've got a challenge for anyone out there reading. It's not extremely hard, but for me I can't figure it out.

For this set of assignment, assume existence of a virtual library containing the abstracts (i.e. short descriptions) of scientific publications. Your task will be to index the documents of the library to make them searchable for full-text retrieval.

You will implement the entire retrieval engine as a web application with PHP.

Preprocessing the documents' contents for full-text retrieval

Any retrieval or search engine relies on a document index which stores the occurrences of each term (words), removing irrelevant terms (stop word removal), and reducing the relevant ones to their case-insentive stem form (stemming). For example, parsing the following text:

"Frequently experienced problems include unavailability of the data in an appropriate form and lack of tools and approaches for its evaluation."

This would produce the following Tokens:

frequent
experience
problem
include
unavailable
data
appropriate
form
lack
tool
approach
evacuation

Let us implement a tokenizer for splitting full texts into tokens, proceeding step -by-step. You can put all your functions into one file Functions.inc and then include this file by all other scripts that use any of its functions (for details see [a href=\"http://us3.php.net/manual/en/function.include.php)\" target=\"_blank\"]http://us3.php.net/manual/en/function.include.php)[/a].

a) As a warm-up, write a PHP script publications.php that simply lists all the documents (i.e. doc_id, author, title, and year)

b) Write a PHP function text2tokens($some_text) that splits the input text (variable $some_text) into single low-case words and returning them as an array. Show how your function performs by writing a dynamic webpage tokenzied.php that shows the original text of each document along with an alphabetical list of words.

c) Write a PHP function removeStopWords($token) that receives an array of tokens (variable $tokens) and returns an array containing only those tokens, not in the stopword list (download the stop-word list). Modify the script tokenize.php from the previous step to display stopword-free tokenzation of all documents.

d) The final step is to implement a function which outputs the frequency of a word that occurs in each document. Develop a table such as this:

doc_id word frequency
1 evaluation 1
1 child 3
2 biology 1
2 camera 1
2 egg 2

I have completed part A B and C, but am having difficulties figuring out part D. I have code examples if anyone is interested in trying this out, and helping me to understan how to solve this problem. I am new with PHP so forgive me if this is a dumb question.



Link to comment
https://forums.phpfreaks.com/topic/4440-php-problem/
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.