Download all PDF files of a URL

Si14 · March 27, 2013

Hi all,

I need a code that downloads all PDF files of a URL (e.g. www.myurl.com)?

I want to run this code on my localhost (WAMP).

Thank You for you time.

exeTrix · March 27, 2013

There's a couple of way to approach this. If you're looking at multiple PDF files and it always will be, you can concatenate the documents together into one document. Otherwise, you can produce a zip file for the user to download, I'd say this is the most common solution utilised in the real world.

Unfortunately, due to HTTP being stateless and following the request response flow it's not possible to responde to a HTTP request with multiple responses. Therefore, you'll notice that the above solutions will produce only one file for download.

Further reading for the zip file can be found on the ZipArchive docs here: http://www.php.net/manual/en/class.ziparchive.php

Any problems then let me know and I'll be happy to provide you further assistance.

Si14 · March 27, 2013

Thank You for you reply exeTrix.

I think I did not clearly expressed the question.

I want to download all the PDF files of a website. Similar to what download managers do. You may ask why I am not using a download manager, and the reponse would be because I want to customize the code later.

At the moment, the basic thing it needs to do is to download all PDF files of one (or multiple) URLs (which I provide) and then store them into separate directories on my hard drive (one directory for one URL). In order to run this code, I assume I should use a server client e.g. WAMP?

Please let me know if you have any suggestions.

exeTrix · March 27, 2013

Ah sorry must have mis understood.

Yes you will need WAMP setup on your machine then this should do what you need it to:

//define our array of files we'd like to get with the dir name as keys
$pdfs = array(
	'folder1' => 'http://www.bbc.co.uk/bbctrust/assets/files/pdf/about/how_we_govern/charter.pdf',
	'folder2' => 'http://www.bbc.co.uk/radio4/today/reports/pdf/camera_gifford.pdf'
);

try{
        //start to loop through the files stored in the pdfs array
	foreach( $pdfs as $key => $pdf ){

		//split the string on /
		$urlParts = explode( '/', $pdf );

		//get the last segment as this is our file name eg charter.pdf
		$fileName = end( $urlParts );

		//get the contents of the file
		$fileContents = file_get_contents( $pdf );

		//get a path to our directory
		$directory = $_SERVER['DOCUMENT_ROOT'] . '/' . $key . '/';

		//check to see if the directory DOESN'T exist
		if( !is_dir( $directory ) ){
			//create the directory
			mkdir( $directory );
		}

		//create a file object for the contents to be written to
		$fileObject = new SPLFileObject( $directory . $fileName, 'a+'  );

		//write the contents to the file
		$fileObject->fwrite( $fileContents );

		//clean up by removing the contents
		unset( $fileContents );

	}

}catch( Exceptions $e ){

	echo $e->getMessage();

}

Any problems then give us a shout

Si14 · March 28, 2013

Ah sorry must have mis understood.

Yes you will need WAMP setup on your machine then this should do what you need it to:

//define our array of files we'd like to get with the dir name as keys
$pdfs = array(
	'folder1' => 'http://www.bbc.co.uk/bbctrust/assets/files/pdf/about/how_we_govern/charter.pdf',
	'folder2' => 'http://www.bbc.co.uk/radio4/today/reports/pdf/camera_gifford.pdf'
);

try{
        //start to loop through the files stored in the pdfs array
	foreach( $pdfs as $key => $pdf ){

		//split the string on /
		$urlParts = explode( '/', $pdf );

		//get the last segment as this is our file name eg charter.pdf
		$fileName = end( $urlParts );

		//get the contents of the file
		$fileContents = file_get_contents( $pdf );

		//get a path to our directory
		$directory = $_SERVER['DOCUMENT_ROOT'] . '/' . $key . '/';

		//check to see if the directory DOESN'T exist
		if( !is_dir( $directory ) ){
			//create the directory
			mkdir( $directory );
		}

		//create a file object for the contents to be written to
		$fileObject = new SPLFileObject( $directory . $fileName, 'a+'  );

		//write the contents to the file
		$fileObject->fwrite( $fileContents );

		//clean up by removing the contents
		unset( $fileContents );

	}

}catch( Exceptions $e ){

	echo $e->getMessage();

}

Any problems then give us a shout

Thanks for your reply and your help.

Instead of the direct PDF links, Is it possible to put the link of the page and then it detects all PDF files of that URL automatically?

exeTrix · March 28, 2013

Ok, couple of possibilities here. You could download the contents of the page using file_get_contents and use a regular expression to match all urls then iterate over the matches. Essentially, you'd be scraping the page for PDF file links. Or you could load the page into DOMDocument and use that to find all links then iterate over them to find PDF's using a RegexIterator.

If you were to use DOMDocument you'd need the page to be valid HTML. So I'd suggest using regex, it'll be easier.

I'm sure there's loads of articles on the web relating to this or something similar so I'm not going to reinvent the wheel and code it for you. Have a bash, any problems post back here and somebody will certainly give you a helping hand.

Sign In

Download all PDF files of a URL

Recommended Posts

Si14

Link to comment

Share on other sites

exeTrix

Link to comment

Share on other sites

Si14

Link to comment

Share on other sites

exeTrix

Link to comment

Share on other sites

Si14

Link to comment

Share on other sites

exeTrix

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information