kmblackbear06 Posted March 8, 2006 Share Posted March 8, 2006 Hey everyone, I've got a challenge for anyone out there reading. It's not extremely hard, but for me I can't figure it out. For this set of assignment, assume existence of a virtual library containing the abstracts (i.e. short descriptions) of scientific publications. Your task will be to index the documents of the library to make them searchable for full-text retrieval. You will implement the entire retrieval engine as a web application with PHP. Preprocessing the documents' contents for full-text retrievalAny retrieval or search engine relies on a document index which stores the occurrences of each term (words), removing irrelevant terms (stop word removal), and reducing the relevant ones to their case-insentive stem form (stemming). For example, parsing the following text:"Frequently experienced problems include unavailability of the data in an appropriate form and lack of tools and approaches for its evaluation."This would produce the following Tokens:frequentexperienceproblemincludeunavailabledataappropriateformlacktoolapproachevacuationLet us implement a tokenizer for splitting full texts into tokens, proceeding step -by-step. You can put all your functions into one file Functions.inc and then include this file by all other scripts that use any of its functions (for details see [a href=\"http://us3.php.net/manual/en/function.include.php)\" target=\"_blank\"]http://us3.php.net/manual/en/function.include.php)[/a].a) As a warm-up, write a PHP script publications.php that simply lists all the documents (i.e. doc_id, author, title, and year) b) Write a PHP function text2tokens($some_text) that splits the input text (variable $some_text) into single low-case words and returning them as an array. Show how your function performs by writing a dynamic webpage tokenzied.php that shows the original text of each document along with an alphabetical list of words. c) Write a PHP function removeStopWords($token) that receives an array of tokens (variable $tokens) and returns an array containing only those tokens, not in the stopword list (download the stop-word list). Modify the script tokenize.php from the previous step to display stopword-free tokenzation of all documents.d) The final step is to implement a function which outputs the frequency of a word that occurs in each document. Develop a table such as this: doc_id word frequency 1 evaluation 1 1 child 3 2 biology 1 2 camera 1 2 egg 2I have completed part A B and C, but am having difficulties figuring out part D. I have code examples if anyone is interested in trying this out, and helping me to understan how to solve this problem. I am new with PHP so forgive me if this is a dumb question. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.