jonnyhocks
-
Posts
1 -
Joined
-
Last visited
Posts posted by jonnyhocks
-
-
Hi all, this is my first time on this forum!
I have a background with HTML and CSS but have recently started a Masters in Computer Science hoping to come out of it with the tools to get a job with PHP development.
Our first assignment has somewhat 'thrown me in the deep end' as we have to construct a search engine that indexes the words of a number of documents and rank them using the TF*IDF algorithm along with the log rule associated with Information retrieval.
I am completely new to PHP so the past week has been something of a crash course - This is the code I have so far:
<?php $filename = 'airlines.txt'; $fp = fopen( $filename, 'r' ); $file_contents = fread( $fp, filesize( $filename ) ); fclose( $fp ); //$new_contents = ereg_replace("[^A-Za-z0-9]", "", $file_contents); /*$file_contents = trim($file_contents); $file_contents = preg_replace('/\h+/', ' ', $file_contents); $file_contents = preg_replace('/\v{3,}/', PHP_EOL.PHP_EOL, $file_contents); */ $pat[0] = "/^\s+/"; $pat[1] = "/\s{2,}/"; $pat[2] = "/\s+\$/"; $rep[0] = ""; $rep[1] = " "; $rep[2] = ""; $new_contents = preg_replace("/[^A-Za-z0-9\s\s+]/","",$file_contents); $new_contents = preg_replace($pat,$rep,$new_contents); //preg_replace('~\s{2,}~', ' ', $text); $commonWords = array('a','able','about','above'........and another few hundreds cut out of this not to hurt your eyes!); $lines = explode ( "\n", $new_contents); $lines2 = implode (" ", $lines); $words = explode ( " ", $lines2 ); $useful_words = array_diff( $words, $commonWords ); /*for($i = 0; $i < count($lines); $i++) { echo "Piece $i = $lines[$i] <br />"; }*/ for($i = 0; $i < count($useful_words); $i++) { echo "Words $i = $useful_words[$i] <br />"; } //$arr=array("blah1","blah2","blah3"); file_put_contents("demo2.txt",implode(" ",$useful_words)); //$file_c = file_get_contents("demo.txt"); //$colms = explode(",",trim($file_c)); //print_r($colms); //echo $lines[2]; ?>
I've got to the stage where that strips out most of the stop words when the final array is printed, but they have been replaced with spaces or something that I have not come acoss because as you may see I had a bit of trouble originally stripping the punctuation marks.
I'm hoping someone can point me in the direction as to how to organise the words I have left after the stripping of stop words which are of no use during the search. I need to store those words into another array and index them which says how many times they appear in that document.
I've come across the function array array_count_values ( array $input ) on the manual but I'm not sure about the best way to use it.
I've attached the files I've used if that helps.
Any help would be greatly appreciated!
[attachment deleted by admin]
Hello all!
in Introductions
Posted
My name's Jon, and I've decided to spend some serious time trying to nail down PHP inside my head.
I'm a front-end developer but have had to learn how to theme up Wordpress websites in work and so that led to my basic understanding of PHP. Now I want to try and get further into the language so have joined up here to try and work out other people's code etc.
I'd also like some suggestions of mock up projects that I could work on because I find that when you are forced into doing something, that is when it sticks.
Looking forward to it!