jonnyhocks

New Members

View Profile See their activity

Posts
1
Joined
October 25, 2010
Last visited
November 14, 2020

Content Type

All Activity

Profiles

Forums

Topics
Posts

Everything posted by jonnyhocks

Hello all!

jonnyhocks posted a topic in Introductions

My name's Jon, and I've decided to spend some serious time trying to nail down PHP inside my head. I'm a front-end developer but have had to learn how to theme up Wordpress websites in work and so that led to my basic understanding of PHP. Now I want to try and get further into the language so have joined up here to try and work out other people's code etc. I'd also like some suggestions of mock up projects that I could work on because I find that when you are forced into doing something, that is when it sticks. Looking forward to it!
- February 20, 2012
- 2 replies
PHP TF*IDF Search application.

jonnyhocks posted a topic in PHP Coding Help

Hi all, this is my first time on this forum! I have a background with HTML and CSS but have recently started a Masters in Computer Science hoping to come out of it with the tools to get a job with PHP development. Our first assignment has somewhat 'thrown me in the deep end' as we have to construct a search engine that indexes the words of a number of documents and rank them using the TF*IDF algorithm along with the log rule associated with Information retrieval. I am completely new to PHP so the past week has been something of a crash course - This is the code I have so far: <?php $filename = 'airlines.txt'; $fp = fopen( $filename, 'r' ); $file_contents = fread( $fp, filesize( $filename ) ); fclose( $fp ); //$new_contents = ereg_replace("[^A-Za-z0-9]", "", $file_contents); /*$file_contents = trim($file_contents); $file_contents = preg_replace('/\h+/', ' ', $file_contents); $file_contents = preg_replace('/\v{3,}/', PHP_EOL.PHP_EOL, $file_contents); */ $pat[0] = "/^\s+/"; $pat[1] = "/\s{2,}/"; $pat[2] = "/\s+\$/"; $rep[0] = ""; $rep[1] = " "; $rep[2] = ""; $new_contents = preg_replace("/[^A-Za-z0-9\s\s+]/","",$file_contents); $new_contents = preg_replace($pat,$rep,$new_contents); //preg_replace('~\s{2,}~', ' ', $text); $commonWords = array('a','able','about','above'........and another few hundreds cut out of this not to hurt your eyes!); $lines = explode ( "\n", $new_contents); $lines2 = implode (" ", $lines); $words = explode ( " ", $lines2 ); $useful_words = array_diff( $words, $commonWords ); /*for($i = 0; $i < count($lines); $i++) { echo "Piece $i = $lines[$i] <br />"; }*/ for($i = 0; $i < count($useful_words); $i++) { echo "Words $i = $useful_words[$i] <br />"; } //$arr=array("blah1","blah2","blah3"); file_put_contents("demo2.txt",implode(" ",$useful_words)); //$file_c = file_get_contents("demo.txt"); //$colms = explode(",",trim($file_c)); //print_r($colms); //echo $lines[2]; ?> I've got to the stage where that strips out most of the stop words when the final array is printed, but they have been replaced with spaces or something that I have not come acoss because as you may see I had a bit of trouble originally stripping the punctuation marks. I'm hoping someone can point me in the direction as to how to organise the words I have left after the stripping of stop words which are of no use during the search. I need to store those words into another array and index them which says how many times they appear in that document. I've come across the function array array_count_values ( array $input ) on the manual but I'm not sure about the best way to use it. I've attached the files I've used if that helps. Any help would be greatly appreciated! [attachment deleted by admin]
- October 25, 2010

Sign In

jonnyhocks

Posts

Joined

Last visited

Content Type

Profiles

Forums

Everything posted by jonnyhocks

Hello all!

PHP TF*IDF Search application.

Browse

Activity

Important Information