So I've completed my task, and I wanted to add some tips and documentation in case anyone else has this problem in using PHP and Hadoop together.
Here is my mapper.php (No need for a reducer.php. I don't specify the file I'm opening because it's defined as part of the input stream into STDIN when I create a task in my EMR cluster.)
#!/usr/bin/php
<?php
$word2count = array();
$counter = 0;
$closeit = false;
while (($closeit == false)&& (($line = fgets(STDIN)) !== false)) {
$counter++;
$line = strtolower(trim($line));
echo "$line\n";
if($counter > 100){ $closeit = true; }
}
?>
Amazon EMR with PHP scripts is just Hadoop streaming. There is very little documentation for PHP on EMR, but there is TONS of documentation for Hadoop streaming.
You can run PHP scripts using Hadoop streaming to STDIN. In a typical Hadoop job, any input you specify within your mapper script must be a key-value pair. This can be complex for debugging, and might cause problems for beginners.
Because of this, I'd suggest running a "mapper-only" job if you're just beginning to learn Hadoop. This can be done in Hadoop streaming/EMR by specifying "NONE" as the location of your reducer script. When you do this, everything you write to STDOUT will be written directly to output files instead of going through a typical mapper/reducer process. And you can have just about any output you want, without worrying about your output causing mapper-reducer conflicts or anything like that.
In other words, a mapper-only Hadoop job becomes a "cluster computing" job, where many computers split up and share the work.
Mapper-only jobs are a great way to get started with Hadoop, without having to be a data scientist.
But one word of caution: Mapper-only jobs can create A LOT of storage data very quickly. So make sure to put some controls in your PHP script to prevent accidentally racking up hundreds of dollars in storage, bandwidth and processing costs on Amazon.
Hopefully, this information will be helpful to any beginners wanting to play with PHP on Hadoop or Amazon Elastic Mapreduce. Please contact me if you have questions or comments.