wwallace Posted December 31, 2013 Share Posted December 31, 2013 So I really enjoyed the tutorial on this web site: http://www.fightswithbytes.com/2013/04/05/sample-wordcount-streaming-job-using-php-on-commoncrawl-dataset/ But felt that the tutorial is missing a lot in terms of functionality. Specifically, with some modification to the code I was able to load URLs from the common crawl, but was unable to load actual web page contents or any other meta data. I need a PHP script that can load the full contents of common crawl pages one at a time, so that I can run some sort of analysis on each individual page one at a time. How would I go about doing this? Quote Link to comment https://forums.phpfreaks.com/topic/284999-php-and-common-crawl-on-aws-amazon-web-services/ Share on other sites More sharing options...
dalecosp Posted December 31, 2013 Share Posted December 31, 2013 If you have a URL, you can use something as simple as file_get_contents() to get page content. There are some quid pro quos --- PHP configuration must be set to allow this (see allow_url_fopen, I think). The next option is cURL, which is often used for this sort of this. Depending on your environment, there may be external programs that could be leveraged for this (for example, many 'Nix environments have the lynx browser installed, which can be called with "-dump" to give you a code dump of a page. Otherwise you're probably reduced to writing something yourself using socket functions. HTH, Quote Link to comment https://forums.phpfreaks.com/topic/284999-php-and-common-crawl-on-aws-amazon-web-services/#findComment-1463439 Share on other sites More sharing options...
wwallace Posted December 31, 2013 Author Share Posted December 31, 2013 (edited) Did you actually manage to make this work? I've tried numerous methods, including some similar to what you're suggesting, but the common crawl file does not seem to work like a normal s3:// bucket. A few things which I suspect may complicate matters: - The commoncrawl file is over one hundred terabytes in size, which is why I want to read it line by line as a stream - Common crawl is stored as HDFS filesystem - Working in hadoop is very tricky, because if its mapper/reducer, key-value-pair architecture My preference would be to work on the common crawl file without using hadoop if possible. Unfortunately, there is almost no documentation for working with common crawl using PHP. EDIT: I would pay money for someone to solve this problem for me. Edited December 31, 2013 by wwallace Quote Link to comment https://forums.phpfreaks.com/topic/284999-php-and-common-crawl-on-aws-amazon-web-services/#findComment-1463488 Share on other sites More sharing options...
wwallace Posted December 31, 2013 Author Share Posted December 31, 2013 A quick update on my situation for reference to anyone else with the same issue: Amazon EMR with PHP is simply hadoop streaming. Although documentation is sparse for EMR with PHP, there is a LOT of documentation for streaming hadoop with PHP. Although the key-value-pair and mapper-reducer nature of hadoop still complicates things significantly for someone who is new to big data. Additionally, the structure of common crawl is outlined here. I'm not even close to having my problem solved yet. But I'll keep working at it and update for anyone who might be interested. In the meantime, please feel free to put your own solutions if you think you might be able to help out. Quote Link to comment https://forums.phpfreaks.com/topic/284999-php-and-common-crawl-on-aws-amazon-web-services/#findComment-1463496 Share on other sites More sharing options...
wwallace Posted January 2, 2014 Author Share Posted January 2, 2014 (edited) So I've completed my task, and I wanted to add some tips and documentation in case anyone else has this problem in using PHP and Hadoop together. Here is my mapper.php(No need for a reducer.php. I don't specify the file I'm opening because it's defined as part of the input stream into STDIN when I create a task in my EMR cluster.) #!/usr/bin/php <?php $word2count = array(); $counter = 0; $closeit = false; while (($closeit == false)&& (($line = fgets(STDIN)) !== false)) { $counter++; $line = strtolower(trim($line)); echo "$line\n"; if($counter > 100){ $closeit = true; } } ?> Amazon EMR with PHP scripts is just Hadoop streaming. There is very little documentation for PHP on EMR, but there is TONS of documentation for Hadoop streaming. You can run PHP scripts using Hadoop streaming to STDIN. In a typical Hadoop job, any input you specify within your mapper script must be a key-value pair. This can be complex for debugging, and might cause problems for beginners. Because of this, I'd suggest running a "mapper-only" job if you're just beginning to learn Hadoop. This can be done in Hadoop streaming/EMR by specifying "NONE" as the location of your reducer script. When you do this, everything you write to STDOUT will be written directly to output files instead of going through a typical mapper/reducer process. And you can have just about any output you want, without worrying about your output causing mapper-reducer conflicts or anything like that. In other words, a mapper-only Hadoop job becomes a "cluster computing" job, where many computers split up and share the work. Mapper-only jobs are a great way to get started with Hadoop, without having to be a data scientist. But one word of caution: Mapper-only jobs can create A LOT of storage data very quickly. So make sure to put some controls in your PHP script to prevent accidentally racking up hundreds of dollars in storage, bandwidth and processing costs on Amazon. Hopefully, this information will be helpful to any beginners wanting to play with PHP on Hadoop or Amazon Elastic Mapreduce. Please contact me if you have questions or comments. Edited January 2, 2014 by wwallace Quote Link to comment https://forums.phpfreaks.com/topic/284999-php-and-common-crawl-on-aws-amazon-web-services/#findComment-1463642 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.