Partition , Clustering , Hashing

abhinavsingh89 · November 27, 2010

I have a database of more than 35k words and because of that i am facing speed issues. Now i want to divide my database on partition basis i.e. partition A must have words starting from A only and so on.. I dont know how to apply partition by on varchars.. or if you can suggest a way to improve retrieval speed than that will be also fine...so kindly help

ABhinav

Pikachu2000 · November 27, 2010

35k records is nothing for a database. There has to be another reason for the speed problems.

s0c0 · November 27, 2010

I assume the slow down is in LOOKUPS. Pikachu is correct, 35k or even 10x that amount is nothing for a database. Please post your create table syntax and an example of a typical record in there.

abhinavsingh89 · November 28, 2010

Table structure for table `words`

--

CREATE TABLE IF NOT EXISTS `words` (

`word` varchar(50) NOT NULL,

`tye` enum('0','1') NOT NULL,

PRIMARY KEY (`word`)

) ENGINE=MyISAM DEFAULT CHARSET=latin1;

Now i am adding words into that database from reading thousands of file and i have to analyze those words later also so that again reading from file and checking from the database.. All i want to do is create partitions on the basis of alphabets.. i.e partition a must contain words starting from a and so on..

And obviously the lookup is taking time....

<?php
require_once("connection.php");
if ($handle = opendir('pos')) {
    //echo "Directory handle: $handle\n";
    //echo "Files:\n";

    while (false !== ($file = readdir($handle))) {
        
	if(strcmp($file,".")==0 || strcmp($file,"..")==0)
	continue;

	echo "$file<br>";
$work_file=fopen("pos/$file",'r');
/*$fdata = fread($work, filesize("pos/$file"));
echo $fdata;
*/
while (!feof($work_file))
{

	set_time_limit(20);
	$c=fgets($work_file);
	//echo $c;
	$words = explode(" ",$c);

	//echo "<br><br>" ;
	$insword="";
	foreach ($words as $word)
	{
		if(strcmp($word,".")==0||strcmp($word,",")==0||strcmp($word,"(")==0||strcmp($word,")")==0 || strcmp($word,"\n")==0 || strcmp($word,"\r")==0 || strcmp($word,"\r\n")==0)
		continue;
		$word=addslashes($word);
		$query="Select * from stop_words where stop_word='$word'";
		$result=mysql_query($query) or die($query.mysql_error());
		if(mysql_num_rows($result)>=1) continue;

		$query="Select * from words where word='$word'";
		$result=mysql_query($query) or die($query.mysql_error());
		if(mysql_num_rows($result)>=1) continue;
		$query="insert into words values('$word','0')";
		$result=mysql_query($query);


	}

}



    }

    closedir($handle);
}
?>

PFMaBiSmAd · November 28, 2010

The code you posted can be made at least 3x more efficient (you can do this using a single (one) query) and the strcmp() on the $word variable is probably not doing what you expect.

To make your existing queries more efficient -

1) I recommend putting your stop words into your main words table and adding a column that 'flags' the stop words so that you can skip selecting them under normal processing. This will eliminate the first query.

2) You don't need to select a word to test if it exists, because you can simply attempt to insert the word and the database will prevent duplicates because your word column is a primary key. This will eliminate the second query.

Using items #1 and #2 above will reduce your foreach() loop to the following. The echo/var_dump statements are to show what you are inserting (pertains to what your strcmp() is doing or is not doing) and what the result of executing the query is. -

			foreach ($words as $word)
		{
			if(strcmp($word,".")==0||strcmp($word,",")==0||strcmp($word,"(")==0||strcmp($word,")")==0 || strcmp($word,"\n")==0 || strcmp($word,"\r")==0 || strcmp($word,"\r\n")==0)
				continue;
			echo "Insert: [",var_dump($word),"]";
			$word=mysql_real_escape_string($word);
			$query="insert into words values('$word','0')";
			var_dump(mysql_query($query));
			echo "<br />";
		}

Next, your strcmp() logic will only remove those characters you are testing for when those characters are separated by a space from any other characters or any word. If your intent is to filter out and remove those characters when they are immediately before/after a word, that logic won't do it. The dot, comma, (), and any \r or \n will be inserted with the word into the table when they are immediately before or after the word. Your existing logic also allows empty strings to be treated as a word because two or more spaces when exploded by the space character results in empty strings in the resulting array. See what the echo/var_dump in the code I posted shows for a bunch of test cases.

To fix the strcmp(), I recommend either using trim() with the second parameter set to a list of characters you want to remove before/after each word and make sure you skip inserting empty strings or more simply use str_word_count instead of explode() to extract just the words from each line.

If you are looping over thousands of files, your overall speed problem is more likely due to the disk access needed to read that number of files and the time it takes to read each line from each file. What is the maximum expected size of one of these files, because it would be more efficient to read one whole file at a time, get the unique words out of it, and then insert them all at once into your database table.

PFMaBiSmAd · November 29, 2010

Assuming that the maximum size of any of your files is small enough so that it can be read into memory all at once, that the maximum number of unique words in any file is small enough so that a multi-value insert query does not exceed the max_allowed_packet setting (default 1Mbyte), and that your tye column has a default value set to zero in your table definition, see the following sample code -

<?php
require_once("connection.php");
if ($handle = opendir('pos')){
    while (false !== ($file = readdir($handle))){
	if(strcmp($file,".")==0 || strcmp($file,"..")==0) continue; // skip the . and .. entries
	echo "$file<br />";
	$words = file_get_contents("pos/$file"); // read the whole file
	$words = str_word_count($words, 1); // get array of words
	$words = array_map('strtolower',$words); // convert all to lower case (so that duplicates can be found ignoring letter case)
	$words = array_unique($words); // remove any duplicates
	$words = array_map('mysql_real_escape_string',$words); // escape the data
	// form a multi-value query - INSERT INTO words (word) values (''),(''),('')
	$query = "INSERT IGNORE INTO words (word) values ('" . implode("'),('",$words) . "')";
	set_time_limit(20); // probably not needed
	echo $query; // show the finished query
	mysql_query($query); // execuite the query
	echo mysql_error() . mysql_affected_rows(); // show any errors and the number of words that were inserted
	echo "<br />";
}
closedir($handle);
}
?>

abhinavsingh89 · November 29, 2010

omg you are a genius.. great solution... the second one was best.. none of my file is more than 1mb and whole 36k records was inserted within few seconds only..... the strcmp problem was also spot on... now i am using str_replace to replace those \n \r immediately after words... also i have added my stop words in my main table only.... so thnx a lot for all the solutions...

Regards

Abhinav

PFMaBiSmAd · November 29, 2010

If you are using the version of the code with str_word_count(), it removes \r \n and you don't need to do anything else.

abhinavsingh89 · November 29, 2010

i need help again..

These earlier things were used to create database..

Now the main thing is this

<?php
require_once("connection.php");
$msg=$_POST['textarea'];
$words=explode(" ",$msg);
$hpos=0;
$pos=0;
$nf=0;
$stop=0;
foreach ($words as $word)
{

$query="Select * from words where word='$word'";
$result=mysql_query($query);
if($result)
{
	$row=mysql_fetch_array($result);
	if($row[1]==1 && $row[2]==0)
	$hpos++;
	else if($row[1]==0 && $row[2]==0)
	$pos++;
	else if($row[2]==1)
	$stop++;
	else $nf++;
}

}
echo "No of positives :$pos <br> No of highly positives : $hpos <br> No of Stop Words : $stop <br> No of Not Found : $nf";


?>

like i am taking a text as an input and verifying its words with the database.. now the lookup is again taking time.. can you suggest anything better??? i know you are a genius and will come up with something great

PFMaBiSmAd · November 29, 2010

It would help if you defined what columns $row[1] and $row[2] are, or better yet, use the associative index NAME so they have meaning (they don't seem to match your table definition, because your word column would be $row[0] or $row['word'] and your tye column would be $row[1] or $row['tye'].)

Also, an example of what you are entering in the textfield and what the expected results after executing the code should be would be worth a 1000 words.

abhinavsingh89 · November 29, 2010

my new table looks like this.... i changed it as per your suggestion and this one is better...

CREATE TABLE IF NOT EXISTS `words` (

`word` varchar(50) NOT NULL,

`type` enum('0','1') NOT NULL,

`stop_word` enum('0','1') NOT NULL,

PRIMARY KEY (`word`)

) ENGINE=MyISAM DEFAULT CHARSET=latin1;

so the row[0] is for words, row[1] is for type and row[2] is for stop_words... script is error free...

now the problem is with speed.. like suppose the text area input is

"films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before .

for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen .

to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd .

the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes .

in other words , don't dismiss this film because of its source .

if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes .

getting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ?

the ghetto in question is , of course , whitechapel in 1888 london's east end .

it's a filthy , sooty place where the whores ( called " unfortunates " ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision .

when the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case .

abberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium .

upon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach .

i don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay .

in the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end .

it's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts .

and from hell's ending had me whistling the stonecutters song from the simpsons for days ( " who holds back the electric car/who made steve guttenberg a star ? " ) .

don't worry - it'll all make sense when you see it .

now onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) .

the print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic .

oscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place .

even the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent .

ians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham .

i cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad .

the film , however , is all good .

2 : 00 - r for strong violence/gore , sexuality , language and drug content "

The output is

No of positives :443

No of highly positives : 0

No of Stop Words : 335

No of Not Found : 0

but it is taking some time.. i want to make it more efficient so that it must not take much time.. can you suggest anything for this?? i mean mainly the lookup is taking time....

DavidAM · November 29, 2010

You are still exploding on space, where str_word_count() might give a better answer.

You are looking up several words multiple times. This is a waste of resources. You can use array_unique() to get the unique words or use array_count() to get the unique words AND the number of times each word is found. Then just lookup each word once:

$words=explode(" ",$msg);
// $words = str_word_count($msg, 1); // seems like a better way
$uWords = array_count($words);
$hpos=0;
$pos=0;
$nf=0;
$stop=0;
foreach ($uWords as $word => $count)
{
	// do your database lookup, but instead of incrementing the counters, add $count

	if($row[1]==1 && $row[2]==0)
	$hpos += $count;
	else if($row[1]==0 && $row[2]==0)
	$pos += $count;
	else if($row[2]==1)
	$stop += $count;
	else $nf += $count;
}

Also, after loading a large amount of data into the database, you might need to rebuild the indexes (I'm not sure if this is required with mySql Primary Keys or not, but it shouldn't hurt) see ANALYZE TABLE or OPTIMIZE TABLE in the mySql Documentation

Sign In

Partition , Clustering , Hashing

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information