Hi all, this is my first time on this forum!
I have a background with HTML and CSS but have recently started a Masters in Computer Science hoping to come out of it with the tools to get a job with PHP development.
Our first assignment has somewhat 'thrown me in the deep end' as we have to construct a search engine that indexes the words of a number of documents and rank them using the TF*IDF algorithm along with the log rule associated with Information retrieval.
I am completely new to PHP so the past week has been something of a crash course - This is the code I have so far:
<?php
$filename = 'airlines.txt';
$fp = fopen( $filename, 'r' );
$file_contents = fread( $fp, filesize( $filename ) );
fclose( $fp );
//$new_contents = ereg_replace("[^A-Za-z0-9]", "", $file_contents);
/*$file_contents = trim($file_contents);
$file_contents = preg_replace('/\h+/', ' ', $file_contents);
$file_contents = preg_replace('/\v{3,}/', PHP_EOL.PHP_EOL, $file_contents); */
$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";
$new_contents = preg_replace("/[^A-Za-z0-9\s\s+]/","",$file_contents);
$new_contents = preg_replace($pat,$rep,$new_contents);
//preg_replace('~\s{2,}~', ' ', $text);
$commonWords = array('a','able','about','above'........and another few hundreds cut out of this not to hurt your eyes!);
$lines = explode ( "\n", $new_contents);
$lines2 = implode (" ", $lines);
$words = explode ( " ", $lines2 );
$useful_words = array_diff( $words, $commonWords );
/*for($i = 0; $i < count($lines); $i++) {
echo "Piece $i = $lines[$i] <br />";
}*/
for($i = 0; $i < count($useful_words); $i++) {
echo "Words $i = $useful_words[$i] <br />";
}
//$arr=array("blah1","blah2","blah3");
file_put_contents("demo2.txt",implode(" ",$useful_words));
//$file_c = file_get_contents("demo.txt");
//$colms = explode(",",trim($file_c));
//print_r($colms);
//echo $lines[2];
?>
I've got to the stage where that strips out most of the stop words when the final array is printed, but they have been replaced with spaces or something that I have not come acoss because as you may see I had a bit of trouble originally stripping the punctuation marks.
I'm hoping someone can point me in the direction as to how to organise the words I have left after the stripping of stop words which are of no use during the search. I need to store those words into another array and index them which says how many times they appear in that document.
I've come across the function array array_count_values ( array $input ) on the manual but I'm not sure about the best way to use it.
I've attached the files I've used if that helps.
Any help would be greatly appreciated!
[attachment deleted by admin]