Hey everyone, I've got a challenge for anyone out there reading. It's not extremely hard, but for me I can't figure it out.
For this set of assignment, assume existence of a virtual library containing the abstracts (i.e. short descriptions) of scientific publications. Your task will be to index the documents of the library to make them searchable for full-text retrieval.
You will implement the entire retrieval engine as a web application with PHP.
Preprocessing the documents' contents for full-text retrieval
Any retrieval or search engine relies on a document index which stores the occurrences of each term (words), removing irrelevant terms (stop word removal), and reducing the relevant ones to their case-insentive stem form (stemming). For example, parsing the following text:
"Frequently experienced problems include unavailability of the data in an appropriate form and lack of tools and approaches for its evaluation."
This would produce the following Tokens:
Let us implement a tokenizer for splitting full texts into tokens, proceeding step -by-step. You can put all your functions into one file Functions.inc and then include this file by all other scripts that use any of its functions (for details see [a href=\"http://us3.php.net/manual/en/function.include.php)\" target=\"_blank\"]http://us3.php.net/manual/en/function.include.php)[/a].
a) As a warm-up, write a PHP script publications.php that simply lists all the documents (i.e. doc_id, author, title, and year)
b) Write a PHP function text2tokens($some_text) that splits the input text (variable $some_text) into single low-case words and returning them as an array. Show how your function performs by writing a dynamic webpage tokenzied.php that shows the original text of each document along with an alphabetical list of words.
c) Write a PHP function removeStopWords($token) that receives an array of tokens (variable $tokens) and returns an array containing only those tokens, not in the stopword list (download the stop-word list). Modify the script tokenize.php from the previous step to display stopword-free tokenzation of all documents.
d) The final step is to implement a function which outputs the frequency of a word that occurs in each document. Develop a table such as this:
doc_id word frequency
1 evaluation 1
1 child 3
2 biology 1
2 camera 1
2 egg 2
I have completed part A B and C, but am having difficulties figuring out part D. I have code examples if anyone is interested in trying this out, and helping me to understan how to solve this problem. I am new with PHP so forgive me if this is a dumb question.
No replies to this topic
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users