Jump to content

Filtering Script Finished! BUT very slow other ways of doing this?


Recommended Posts

Hello Everyone,

 

I am looking for advice on how to be able to filter out over 3milion records a which will happen once a month.  right now I have a scripte that does the following:

 

gets total record count in table,

gets first record from table,

checks to see if its in the block table

  if yes it delete it from the rawdata table

  if no then it checks to see if meets a number of WHERE cluase statements (this will grow over time)

    if no match it adds it to the block table

    if match and is already in the query table it needs to update / replace the record in the query table

    if match it adds it to the query table

does this into the table is finished

 

ok now here is the file I have that does this:

<?php session_start(); ?>

<head>
<style type="text/css">
body {background-color:black;color:yellow;}
</style>
</head>
<body>

<?php

ini_set('display_errors', 'ON');
ini_set('error_reporting', E_ALL | E_STRICT );

if(isset($_SESSION[ "user_id" ]) & isset($_SESSION[ "user_access_level" ])) 
{
include '/home/dev/www/lib/db_config_cr-dev.php';
include '/home/dev/www/lib/db_conn-select.php';

// Counts record in table
$query_s = "SELECT COUNT(*) FROM irs_rawdata";
$result_s = mysql_query($query_s) OR die("Sorry, unable to count the records in the table!");
$result_c = mysql_result($result_s, 0);
$record_count = $result_c;

echo "There is ". $record_count ." records in the irs_rawdata table<br /><br />";

	// checks to see if sesult_c is greater then zero
	// if so then it runs script
	// otherwise it says there is no more records to process
	for($i=1; $i <= $result_c; $i++)
	{
	//used to change $i to a meanful var name
	$record_id = $i;

	// selects and limits it to the first row only and retrives the EIN then return a 1 or 0
	$query_s = "SELECT EIN FROM irs_rawdata LIMIT 0,1";
	$result_s = mysql_query($query_s) OR die("Sorry, unable to to select EIN for the first record! <br /> ". mysql_error());
	$record = mysql_fetch_assoc($result_s);
	$count = mysql_num_rows($result_s);

		if($count)
		{
		// changes long var to short var
		$fin = $record['EIN'];

		// gets record if there is one in block returns 1 or 0
		echo "We are now processing record ". $record_id. " with Fed ID #". $fin ."!<br />";
		$query_s = "SELECT fin FROM block WHERE fin = '$fin'";
		$result_s = mysql_query($query_s) OR die("Sorry, unable to select record: " . mysql_error());
		$record = mysql_fetch_assoc($result_s);
		$count = mysql_num_rows($result_s);

			if ($count == 0)
			{
			echo "There was no record found in the \"block\" table!<br />";
			echo "We are now checking if record ". $record_id ." matchs the required criteria!<br />";
			$query_is = "INSERT INTO query SELECT * FROM irs_rawdata 
			WHERE EIN = '$fin' AND 
			(
			PNO LIKE '%blind%' OR 
			PNO LIKE '%deaf%' OR 
			PNO LIKE '%braille%' OR 
			PNO LIKE '%sight%' OR
			PNO LIKE '%vision%' OR
			PNO LIKE '%sign%' OR 
			PNO LIKE '%ear%' OR
			PNO LIKE '%eye%' OR
			PNO LIKE '%dog%' OR
			Activity_Code LIKE '%031%' OR
			NTEE_Code LIKE '%G41%' OR
			NTEE_Code LIKE '%G42%' OR
			NTEE_Code LIKE '%H41%' OR
			NTEE_Code LIKE '%H42%' OR
			NTEE_Code LIKE '%P86%' OR
			NTEE_Code LIKE '%P87%'

			)";
			$result_is = mysql_query($query_is);
			$count = mysql_affected_rows();

				if($count)
				{
				echo "We have successfully added the record to the \"query\" table, it match the criteria!<br />";
				echo "We are now deleting record ". $record_id . " from \"irs_rawdata\" table!<br />";
				$query_d = "DELETE FROM irs_rawdata WHERE EIN = '$fin'";
				$result_d = mysql_query($query_d);
				$count = mysql_affected_rows();

					if($count)
					{
					echo "We have successfully deleted the record from the \"irs_rawdata\" table!<br /><br />";
					}
				}
				else
				{
				echo "We was unable to match the record ". $record_id ." to the criteria!<br />";
				echo "We are now adding record ". $record_id . " to \"block\" table!<br />";
				$query_i = "INSERT INTO block (fin, who_blocked) VALUES ('$fin', 'S-123456789')";
				$result_i = mysql_query($query_i) OR die("Sorry was unable to insert ".$fin." into the database table allowed! <br />" . mysql_error());
				$count = mysql_affected_rows();

						if($count)
						{
						echo "We have successfully added the record to the \"block\" table!<br />";
						echo "We are now deleting record ". $record_id . " from \"irs_rawdata\" table!<br />";
						$query_d = "DELETE FROM irs_rawdata WHERE EIN ='$fin'";
						$result_d = mysql_query($query_d) OR die("Was unable to delete ".$fin." from the db table irs_rawdata! <br />" . mysql_error());
						$count = mysql_affected_rows();

							if($count)
							{
							echo "We have successfully deleted the record from the \"irs_rawdata\" table!<br /><br />";
							}
						}
						else
						{
						echo "We was unable to successfully added the record to the \"block\" table!<br />";
						}
				}

			}
			else
			{
			echo "There was one record found in the \"block\" table!<br />";
			echo "We are now deleting record ". $record_id . " from \"irs_rawdata\" table!<br />";
			$query_d = "DELETE FROM irs_rawdata WHERE EIN = '$fin'";
			$result_d = mysql_query($query_d) OR die("Was unable delete ".$fin." from the db table irs_rawdata! <br />" . mysql_error());
			$count = mysql_affected_rows();

				if($count)
				{
				echo "We have successfully deleted the record from the \"irs_rawdata\" table!<br /><br />";
				}
			}
		}
	}
	echo "The computer has finished processing the records in irs_rawdata!";

include_once ('/home/dev/www/lib/db_close.php');
}
else
{
echo "You are not allowed on this page!";
}
?>

 

PLEASE I need advice on this,  can mysql do the same job with out needing to use php to do the if stuff it is slow and I am only processing right this sec of the post 607,000 and its slow I can not even get my other web pages to load because of this.

 

Sincerely,

Christopher

A few things:

 

Is this a continuous database? Meaning, once you check all 3M records for this month, do you check the entire database again next month with all new additions? If this is the case, I would suggest some sort of archiving or checking the database from the last record (or maybe roll back a few thousand records) the next time to cut down the total of records that need to be processed.

 

Are you using indexes properly? This could help speed things up if they are not being used properly.

 

Try the EXPLAIN syntax on your queries to see if you can tweak them at all. There are a few good tutorials if you search google related to EXPLAIN.

Sorry another thing I just thought of as well:

 

Can any of the functions that are performed on a monthly database be performed at the time of their initial insert?

 

You could break this up into several tables and perform some of these functions on different tables at other times. This will also cut down on overall time for such a large table if they are smaller.

ok, every month the irs updates the files, so yes it is every month that this would be done,

 

next once the files are upload to a table "irs_rawdata"

then we have to go through and make sure there is no NEW records that would match the rules of the WHERE clause,

instead of having it run that query on every record again, we have a table called block which hold all the record EIN that don't match the rules of the WHERE clause,  so it checks this to see if its in there first, if its not THEN it needs to be checked for the WHERE clause  if it matches then it goes into query if it does not match it gets added to the block so next time the process will be a little faster :-) (hopful)  if the record is in the query table already then replace it with the new one no matter what.  now the fun part comes in, now a human logs in and has to start to sort out and update all the missing information for the records that are good and the ones that the computer passed but really don't match need to be added to the block table, so then the next time we update the records from the irs its going to be much faster as its not going to be adding much at all to the query table, its mostly going to be adding / updating what is not in the block table which is a small % of the 3m

 

either way its still a very slow process to do through a web page script :-(

 

I hope I answer the needed questions

 

Thanks for the reply and help!

 

Sincerely,

Christopher

the files from the irs are either 50ish or 6 or 4 

 

this files are uploaded to the table irs_rawdata with LOAD DATA LOCAL INFILE ...

if there was a effective way of uploading the data while doing the processing of the files then that would cut the processing down by some,  like read row 1 of file, then if match add to query table, if not match add EIN (col 1) to Block table  from what I knew (which is not alot) the LOAD DATA is the fastest way to do this stuff but it does not allow the filtering of records to go to different places and have one place only allow the ein and the other all the fields

 

Thanks again for your respones I am grateful, this is a project for a non-profit (small one)

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.