Jump to content

PHP and Common Crawl on AWS (Amazon Web Services)


wwallace

Recommended Posts

So I really enjoyed the tutorial on this web site: http://www.fightswithbytes.com/2013/04/05/sample-wordcount-streaming-job-using-php-on-commoncrawl-dataset/

 

But felt that the tutorial is missing a lot in terms of functionality. Specifically, with some modification to the code I was able to load URLs from the common crawl, but was unable to load actual web page contents or any other meta data. 

 

I need a PHP script that can load the full contents of common crawl pages one at a time, so that I can run some sort of analysis on each individual page one at a time. 

 

How would I go about doing this?

Link to comment
Share on other sites

If you have a URL, you can use something as simple as file_get_contents() to get page content. There are some quid pro quos --- PHP configuration must be set to allow this (see allow_url_fopen, I think).

 

The next option is cURL, which is often used for this sort of this.

 

Depending on your environment, there may be external programs that could be leveraged for this (for example, many 'Nix environments have the lynx browser installed, which can be called with "-dump" to give you a code dump of a page.

 

Otherwise you're probably reduced to writing something yourself using socket functions.

 

HTH,

Link to comment
Share on other sites

Did you actually manage to make this work?

 

I've tried numerous methods, including some similar to what you're suggesting, but the common crawl file does not seem to work like a normal s3:// bucket. A few things which I suspect may complicate matters:

 

 - The commoncrawl file is over one hundred terabytes in size, which is why I want to read it line by line as a stream

 - Common crawl is stored as HDFS filesystem

 - Working in hadoop is very tricky, because if its mapper/reducer, key-value-pair architecture

 

My preference would be to work on the common crawl file without using hadoop if possible.

 

Unfortunately, there is almost no documentation for working with common crawl using PHP.

 

EDIT: I would pay money for someone to solve this problem for me.

Edited by wwallace
Link to comment
Share on other sites

A quick update on my situation for reference to anyone else with the same issue:

 

Amazon EMR with PHP is simply hadoop streaming. Although documentation is sparse for EMR with PHP, there is a LOT of documentation for streaming hadoop with PHP. Although the key-value-pair and mapper-reducer nature of hadoop still complicates things significantly for someone who is new to big data.

 

Additionally, the structure of common crawl is outlined here.

 

I'm not even close to having my problem solved yet. But I'll keep working at it and update for anyone who might be interested. In the meantime, please feel free to put your own solutions if you think you might be able to help out.

Link to comment
Share on other sites

So I've completed my task, and I wanted to add some tips and documentation in case anyone else has this problem in using PHP and Hadoop together.

 

Here is my mapper.php
(No need for a reducer.php. I don't specify the file I'm opening because it's defined as part of the input stream into STDIN when I create a task in my EMR cluster.)

#!/usr/bin/php
<?php

$word2count = array();
$counter = 0;
$closeit = false;

while (($closeit == false)&& (($line = fgets(STDIN)) !== false)) {
 $counter++;
 $line = strtolower(trim($line));
 echo "$line\n";
 if($counter > 100){ $closeit = true; }
}

?>

Amazon EMR with PHP scripts is just Hadoop streaming. There is very little documentation for PHP on EMR, but there is TONS of documentation for Hadoop streaming.  :happy-04:

 

You can run PHP scripts using Hadoop streaming to STDIN. In a typical Hadoop job, any input you specify within your mapper script must be a key-value pair. This can be complex for debugging, and might cause problems for beginners.

 

Because of this, I'd suggest running a "mapper-only" job if you're just beginning to learn Hadoop. This can be done in Hadoop streaming/EMR by specifying "NONE" as the location of your reducer script. When you do this, everything you write to STDOUT will be written directly to output files instead of going through a typical mapper/reducer process. And you can have just about any output you want, without worrying about your output causing mapper-reducer conflicts or anything like that. 

 

In other words, a mapper-only Hadoop job becomes a "cluster computing" job, where many computers split up and share the work. 

 

Mapper-only jobs are a great way to get started with Hadoop, without having to be a data scientist. 

 

But one word of caution: Mapper-only jobs can create A LOT of storage data very quickly. So make sure to put some controls in your PHP script to prevent accidentally racking up hundreds of dollars in storage, bandwidth and processing costs on Amazon. 

 

Hopefully, this information will be helpful to any beginners wanting to play with PHP on Hadoop or Amazon Elastic Mapreduce. Please contact me if you have questions or comments. 

Edited by wwallace
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.