Jump to content

kmblackbear06

New Members
  • Posts

    1
  • Joined

  • Last visited

    Never

Profile Information

  • Gender
    Not Telling

kmblackbear06's Achievements

Newbie

Newbie (1/5)

0

Reputation

  1. Hey everyone, I've got a challenge for anyone out there reading. It's not extremely hard, but for me I can't figure it out. For this set of assignment, assume existence of a virtual library containing the abstracts (i.e. short descriptions) of scientific publications. Your task will be to index the documents of the library to make them searchable for full-text retrieval. You will implement the entire retrieval engine as a web application with PHP. Preprocessing the documents' contents for full-text retrieval Any retrieval or search engine relies on a document index which stores the occurrences of each term (words), removing irrelevant terms (stop word removal), and reducing the relevant ones to their case-insentive stem form (stemming). For example, parsing the following text: "Frequently experienced problems include unavailability of the data in an appropriate form and lack of tools and approaches for its evaluation." This would produce the following Tokens: frequent experience problem include unavailable data appropriate form lack tool approach evacuation Let us implement a tokenizer for splitting full texts into tokens, proceeding step -by-step. You can put all your functions into one file Functions.inc and then include this file by all other scripts that use any of its functions (for details see [a href=\"http://us3.php.net/manual/en/function.include.php)\" target=\"_blank\"]http://us3.php.net/manual/en/function.include.php)[/a]. a) As a warm-up, write a PHP script publications.php that simply lists all the documents (i.e. doc_id, author, title, and year) b) Write a PHP function text2tokens($some_text) that splits the input text (variable $some_text) into single low-case words and returning them as an array. Show how your function performs by writing a dynamic webpage tokenzied.php that shows the original text of each document along with an alphabetical list of words. c) Write a PHP function removeStopWords($token) that receives an array of tokens (variable $tokens) and returns an array containing only those tokens, not in the stopword list (download the stop-word list). Modify the script tokenize.php from the previous step to display stopword-free tokenzation of all documents. d) The final step is to implement a function which outputs the frequency of a word that occurs in each document. Develop a table such as this: doc_id word frequency 1 evaluation 1 1 child 3 2 biology 1 2 camera 1 2 egg 2 I have completed part A B and C, but am having difficulties figuring out part D. I have code examples if anyone is interested in trying this out, and helping me to understan how to solve this problem. I am new with PHP so forgive me if this is a dumb question.
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.