Searching Lots of Large Files

Clinger · December 8, 2006

I have a project to scan Optical Character Recognition files. There will be TXT files containing all the words in the document. I need a way to search through all these files quickly. There are going to be a lot of files, and each file is usually somewhat large. (2-3mb each)

I am not looking for "coding help", but a better way of doing this.

Right now I can think of two options.

1.) Scan through all files directly. (Which will take a very long time if there ends up being a lot of files.)
2.) Set up a cron job that calls a PHP script that indexes the words in each file and stores them in a database. Then when a user searches, it links the search words to the database and pulls the relevant files.

Anyone have any better ideas? The files will be uploaded VIA FTP so I can't just automatically scan them when their uploaded. I also have to make sure that I scan all of the files information. I can't just do a quick catch of the first few lines. The search has to be capable of finding the content in any of the pages.

phpPunk · December 8, 2006

Well PHP is the wrong technology for that, but if you must...

I'd suggest scanning each file at upload and storing what you can in a database....users searches are done on the DB not the native file system. Portable, faster indexing, etc...

It'd be very difficult to write something as efficient as a RDBMS even using C.

c4onastick · December 9, 2006

[quote author=phpPunk link=topic=117891.msg481283#msg481283 date=1165606196]
Well PHP is the wrong technology for that, but if you must...

I'd suggest scanning each file at upload and storing what you can in a database....users searches are done on the DB not the native file system. Portable, faster indexing, etc...

It'd be very difficult to write something as efficient as a RDBMS even using C.
[/quote]
Agreed. This is more a perl/python type problem. Loading it into a database is really the only way its going to be feasibly searchable. Depending on the layout you're working with, you could set up PHP to do sort of a "link graph" on each file and put keywords generated by that into the database, but that would require pulling out keywords, which wouldn't make the documents entirely searchable. I'm not exactly sure you want to store 2-3Mb of text in a database either, sure it can be done, but I wonder how much of a performance hit you'll take when you do things like:
[code]"SELECT * FROM files WHERE text LIKE '%Your search here%';"[/code]Databases are good a handling relations, not so good at analyzing text.

If you were to keyword them instead, you could shrink down your database entries to keywords only, and have numeric links to files. This, in my opinion, would be a much more PHP friendly solution. You might have to strike up a balance between total searchability and performance.

Clinger · December 11, 2006

I was actually thinking that the cron could run a script like this. It seems to be capable of scanning the files relatively fast. Basically what it would do is identify each different word and stores it in the database with the file ID and line number. I figure this will create a large database eventually, but if I split it seperate tables alphabetically, I think it might be ok. What do you guys think? Would it destroy the database?

[code]$dir="00";
$linex=0;
$all_words=array();
$handle=opendir($dir);

$id_link=mysql_connect("*****", "******", "******");
mysql_select_db("******");

while (false!==($file=readdir($handle)))
{
if ($file!="."&&$file!="..")
{
$sql="SELECT * FROM `files` WHERE `filename` = '{$file}'";
if (mysql_num_rows(mysql_query($sql)) == 0)
{
echo $file."<hr>";
$conts=file_get_contents("{$dir}/{$file}");
$lines=explode("\n", $conts);
foreach ($lines as $line)
{
$linex++;
$line=trim($line);
$words=explode(" ", $line);

foreach ($words as $word)
{
$word=preg_replace("@<<[^>]+>>@i", "", $word);
$word=preg_replace("@[._(),*$!?'\[\]\"]+@i", "", $word);
$word=trim($word);
if (!empty($word))
{
$word=strtolower($word);
if (!isset($all_words[$word]))
{
$all_words[$word]="{$linex}";
}
else
{
$all_words[$word].=",{$linex}";
}
}
}
}
$sql="INSERT INTO `files` (`filename`) VALUES ('{$file}')";
$query=mysql_query($sql);
$fileID=mysql_insert_id();

$ix=0;
$ux=0;

ksort($all_words);
foreach ($all_words as $key=>$word)
{
$sql="SELECT * FROM `words` WHERE `WORD` = '{$key}'";
if (mysql_num_rows(mysql_query($sql)) > 0)
{
$updates[$ux]="UPDATE DELAYED `words` SET `MATCHES` = CONCAT(`MATCHES`, '{$fileID},{$word};') WHERE `WORD` = '{$key}'";

}
else
{
$inserts[$ix]="('{$key}', '{$fileID},{$word};')";
$ix++;
}
}

if ($ix > 0)
{
//Run INSERT command.
$insert="INSERT INTO `words` (`WORD`, `MATCHES`) VALUES ";
$insert_line=implode(",", $inserts);
$insert.=$insert_line;
if (!mysql_query($insert))
{
echo mysql_error();
echo "<hr>{$insert}";
}
}
break;
}
}
}
mysql_close($id_link);
[/code]

c4onastick · December 12, 2006

Sure, looks like it would work fine. Your databases will get huge fast though dealing with 2-3Mb text files (not really sure how many words that would be, I'm guessing about 100,000 to 500,000 per file, sound about right?). I seems a little too overly verbose to me. I still wouldn't store every single word though, it might make phrase based searching behave weirdly too. I'd still keep the option open to have keyword based database setup. How big is this going to scale? Are we talking 50 2-3Mb files, or 5,000? I think that level of detail would "snowball" on you pretty quickly. But you know what, it looks like you have a working script: try it out! Throw 10 or so (whatever would be a good sample) of your files through it into a test database, see if you like the results. With programming there's always a half-dozen ways you can solve a problem, you just have to pick the one works for you!

That being said, if it were me, I'd take the easy way out and just formulate a way to get some keywords out of these files (could be on the order of 100 keywords or so, depending on the content of these files) and store those. You sacrifice some of the completeness in the solution, but I think you get that back and a little more by having more manageable databases and maybe a little cleaner API when you want to build on it later. The only real "problem" (depending on who you ask, just my opinion) with your approach is you're going to get all the "fluff" along with the "filler". The whole point of indexing things is to pull out the important elements to be representative of the rest of the content. If you pull out ever single word, you'll have a database full of "the, it, to, at, a, he, she..." You've got a great foundation there with that script, sounds like you really know what you're doing and what you want, so give it a shot! If it doesn't work, no biggie, refine it and try again!

Good luck!

Sign In

Searching Lots of Large Files

Recommended Posts

Clinger

Link to comment

Share on other sites

phpPunk

Link to comment

Share on other sites

c4onastick

Link to comment

Share on other sites

Clinger

Link to comment

Share on other sites

c4onastick

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information