Jump to content

Searching through multiple file types


Icebergness

Recommended Posts

Hi,

 

I currently have a 'Research' page on my intranet site. It works by displaying a list of files based on the date selected.

 

The files are stored in a folder hierarchy on the same server, for example:

 

2012 <-Year

--01 <-Month

----01 <-Day

----02 <-Day

 

The folders contain a multitude of files that are dropped in by other members of staff, and 99% of the time consist of .msg, .doc and .pdf files. What I want to do is create a textbox which will allow the user to search through the files (as in the file contents, not just the file name). So far, the best thing I have found for this is the following code:

 

<?php

/**
*	powered by @cafewebmaster.com
*	free for private use
*	please support us with donations
*/



define("SLASH", stristr($_SERVER[sERVER_SOFTWARE], "win") ? "\\" : "/");

$path	= ($_POST[path]) ? $_POST[path] : dirname(__FILE__) ;
$q		= $_POST[q];



function php_grep($q, $path){

	$fp = opendir($path);
	while($f = readdir($fp)){
		if( preg_match("#^\.+$#", $f) ) continue; // ignore symbolic links
		$file_full_path = $path.SLASH.$f;
		if(is_dir($file_full_path)) {
			$ret .= php_grep($q, $file_full_path);
		} else if( stristr(file_get_contents($file_full_path), $q) ) {
			$ret .= "$file_full_path\n";
		}
	}
	return $ret;
}


if($q){
	$results = php_grep($q, $path);
}



echo <<<HRD

<pre >
<form method=post>
	<input name=path size=100 value="$path" /> Path 
	<input name=q size=100 value="$q" /> Query
	<input type=submit>
</form>

	$results

</pre >

HRD;

?>

 

This obviously uses GREP, which works well, albeit slow. However, it doesn't search through PDF's.

 

I have contemplated several solutions, including finding a way to convert all the files in to text files or in to a mysql database, but I haven't found anything useful. I think the answer is going to be no, but I'm asking whether anybody knows, or has successfully implemented a similar system?

 

Big big thanks in advance if you can help in any way!

Dave

Link to comment
https://forums.phpfreaks.com/topic/255419-searching-through-multiple-file-types/
Share on other sites

The PDF file format is not a plain text markup, so opening the file raw and searching it isn't going to yield you reliable results.  You'll need to interpret the file with something that understands the format.  I needed to do something similar and found how sphider made use of xpdf and catdoc for pdf's and doc's respectively.  xpdf has a couple utility programs, pdfinfo and pdftotext which you use to extract the metadata and text which you can in turn search.

  • 2 weeks later...
  Quote

The PDF file format is not a plain text markup, so opening the file raw and searching it isn't going to yield you reliable results.  You'll need to interpret the file with something that understands the format.  I needed to do something similar and found how sphider made use of xpdf and catdoc for pdf's and doc's respectively.  xpdf has a couple utility programs, pdfinfo and pdftotext which you use to extract the metadata and text which you can in turn search.

 

Sorry, I forgot to check this post as I've been working on other projects.

 

I had previously looked at xpdf, but hadn't gotten the results I was looking for. I'll give it a try with Sphider and let you know how I get on. Thanks for the suggestion :)

 

Dave

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.