jonnyhocks Posted October 25, 2010 Share Posted October 25, 2010 Hi all, this is my first time on this forum! I have a background with HTML and CSS but have recently started a Masters in Computer Science hoping to come out of it with the tools to get a job with PHP development. Our first assignment has somewhat 'thrown me in the deep end' as we have to construct a search engine that indexes the words of a number of documents and rank them using the TF*IDF algorithm along with the log rule associated with Information retrieval. I am completely new to PHP so the past week has been something of a crash course - This is the code I have so far: <?php $filename = 'airlines.txt'; $fp = fopen( $filename, 'r' ); $file_contents = fread( $fp, filesize( $filename ) ); fclose( $fp ); //$new_contents = ereg_replace("[^A-Za-z0-9]", "", $file_contents); /*$file_contents = trim($file_contents); $file_contents = preg_replace('/\h+/', ' ', $file_contents); $file_contents = preg_replace('/\v{3,}/', PHP_EOL.PHP_EOL, $file_contents); */ $pat[0] = "/^\s+/"; $pat[1] = "/\s{2,}/"; $pat[2] = "/\s+\$/"; $rep[0] = ""; $rep[1] = " "; $rep[2] = ""; $new_contents = preg_replace("/[^A-Za-z0-9\s\s+]/","",$file_contents); $new_contents = preg_replace($pat,$rep,$new_contents); //preg_replace('~\s{2,}~', ' ', $text); $commonWords = array('a','able','about','above'........and another few hundreds cut out of this not to hurt your eyes!); $lines = explode ( "\n", $new_contents); $lines2 = implode (" ", $lines); $words = explode ( " ", $lines2 ); $useful_words = array_diff( $words, $commonWords ); /*for($i = 0; $i < count($lines); $i++) { echo "Piece $i = $lines[$i] <br />"; }*/ for($i = 0; $i < count($useful_words); $i++) { echo "Words $i = $useful_words[$i] <br />"; } //$arr=array("blah1","blah2","blah3"); file_put_contents("demo2.txt",implode(" ",$useful_words)); //$file_c = file_get_contents("demo.txt"); //$colms = explode(",",trim($file_c)); //print_r($colms); //echo $lines[2]; ?> I've got to the stage where that strips out most of the stop words when the final array is printed, but they have been replaced with spaces or something that I have not come acoss because as you may see I had a bit of trouble originally stripping the punctuation marks. I'm hoping someone can point me in the direction as to how to organise the words I have left after the stripping of stop words which are of no use during the search. I need to store those words into another array and index them which says how many times they appear in that document. I've come across the function array array_count_values ( array $input ) on the manual but I'm not sure about the best way to use it. I've attached the files I've used if that helps. Any help would be greatly appreciated! [attachment deleted by admin] Link to comment https://forums.phpfreaks.com/topic/216775-php-tfidf-search-application/ Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.