talfstad Posted March 2, 2009 Share Posted March 2, 2009 Hey- I am trying to create a php script which will read in a file (.html) and then echo only the email addresses onto the screen. Here's an example of what the .html file looks like: ********************************************************** <tr valign="top"> <td>African Student Drama Association </td> <td>Through the common interest of art, foster unity among students and scholars at SDSU. </td> <td>Adeyinka Glover </td> <td>[email protected]</td> </tr> <tr valign="top"> <td><span style="font-family:times new roman;font-size:16px;">Air Force ROTC, Detachment 075 Honor Guard "The Nighthawks"</span> </td> ********************************************* I am trying to ideally echo only the "[email protected] back onto the screen. Here is the code I've created: *********************************************************** <?php $file = "./test2.txt"; $handle = @fopen($file, "r"); $reg = '/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/'; if ($handle) { while (!feof($handle)) { $buffer = fgetss($handle,4096); } if(preg_match_all($reg, $buffer, $matches)) { foreach( $matches as $val => $i) { echo $val[$i]; } } else { echo "no emails in file"; } fclose($handle); } ?> *************************************** This code returns "no emails in file". I am new to PHP.. and feel a little lost. Can anyone please help? I truly appreciate it, been working on this for a few hours too many.. Thank you Quote Link to comment https://forums.phpfreaks.com/topic/147630-help-extracting-email-address-from-html-file/ Share on other sites More sharing options...
.josh Posted March 3, 2009 Share Posted March 3, 2009 $file = file_get_contents("./test2.txt"); preg_match_all('/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/is',$file,$matches); echo "<pre>";print_r($matches); if ($matches[0]) { echo "no emails in file"; } else { foreach ($matches[0] as $email) { echo "$email<br/>"; } } Quote Link to comment https://forums.phpfreaks.com/topic/147630-help-extracting-email-address-from-html-file/#findComment-775186 Share on other sites More sharing options...
talfstad Posted March 3, 2009 Author Share Posted March 3, 2009 Solved this issue guys. I thought I would post my solution for people out there to use. ********************************************************************* So.. Here it goes. This piece of code goes in process.php: <?php $file = $_POST["filetogetemails"]; $handle = @fopen($file, "r"); $reg = '/[\s]+[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}/'; if ($handle) { while (!feof($handle)) { $buffer .= fgetss($handle); } if(preg_match_all($reg, $buffer, $matches, PREG_SET_ORDER)) { foreach( $matches as $val) { foreach( $val as $i) echo "$i<br />"; } } else { echo "no emails in file"; } fclose($handle); } ?> And this piece of code goes in index.html: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Email Grabbing Genie!!!!</title> </head> <body> <form action="./process.php" method="post"> Insert the URL and I will hook you up with all of the email addresses! <br /> <input maxlength="150" name="filetogetemails" size="80" type="text" /> <input type="submit" value="Get Those Emails!" /> </form> </body> </html> Now put these both in the same directory and you have a somewhat decent email grabber! See ya guys! Quote Link to comment https://forums.phpfreaks.com/topic/147630-help-extracting-email-address-from-html-file/#findComment-775241 Share on other sites More sharing options...
nrg_alpha Posted March 3, 2009 Share Posted March 3, 2009 Detecting valid emails is a hard thing to do, as there are so many valid aspects with regards to email validation in accordance to RFC specifications. As a result, a very thorough function is pretty hefty in size and complexity. But thankfully, some people have already done the hard work for us. I am seeing this link - iamcal - on other sites being shown quite a lot for this very purpose. It is written by someone who has taken the time to really dissect what is valid using RFC specs, and does a basic rundown on the page I just provided (and even provides a 'simplified' function at the bottom of the page). However, also included at the bottom is a download link which points to a full blown RFC 3696 parser function that does all the detailed, nitty gritty validating (be warned, the function in that link is massive [due in part to all the comments flying around]). But it seems extremely thorough. As a result, look here to see what kind of email addresses were tested using RFC 3696 (as well as older parser versions). Seems there are far more valid email formats than perhaps realized. Many things are surprisingly 'permitted' (by RFC), but are not commonly used, which may throw some people off. So if you are looking for ultra strict RFC functionality, this would be something to consider. In any case, there is plenty of info to soak up and absorb within those links. Quote Link to comment https://forums.phpfreaks.com/topic/147630-help-extracting-email-address-from-html-file/#findComment-775319 Share on other sites More sharing options...
nrg_alpha Posted March 3, 2009 Share Posted March 3, 2009 P.S I realize that the stuff in the above links wouldn't be used to successfully scan complete site pages, files or forms for valid emails, but rather could be used to further analyze what was initially 'fetched' from those if someone really wanted extra validation measures in place. Quote Link to comment https://forums.phpfreaks.com/topic/147630-help-extracting-email-address-from-html-file/#findComment-775321 Share on other sites More sharing options...
talfstad Posted March 3, 2009 Author Share Posted March 3, 2009 Thanks for the input nrg_alpha, I really appreciate anyone who takes the time to help. I also noticed in the regex I posted that worked but that some emails would not being grabbed.. So I changed the regex to: '/[\s]*[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}/' and now it picks everything up that I need. you guys can check out what I've got over at http://stingur.com/grabemails/index.php Also, I understand now that file_get_contents() is a much better way to get a file as a string than the way I used... Quote Link to comment https://forums.phpfreaks.com/topic/147630-help-extracting-email-address-from-html-file/#findComment-775716 Share on other sites More sharing options...
nrg_alpha Posted March 3, 2009 Share Posted March 3, 2009 The thing about your pattern is that it will match stuff like: ...%[email protected] What do you do with something like that? Note that whenever you find yourself using character classes that use [a-zA-Z0-9_], you can simplify things by using the word shorthand character class \w. You also don't need to encase your \s inside a character class, as this is already character class short-hand notation (in this case, a short-hand for whitespace characters). You can also add an i modifier after the closing delimiter to make alpha characters case insensitive. So your pattern (keeping the current format checking in place) could become: /\s*[\w.%+-]+@[a-z0-9.-]+\.[a-z]{2,4}/i But ultimately, understand that your pattern can match really odd ball entries.. So while I'm not sure what you do with those matches (aside from echoing them on screen), the purpose of the post I provided that does an in depth email validation could go a long way into checking to see if what your pattern found is in fact a valid entry or not. Quote Link to comment https://forums.phpfreaks.com/topic/147630-help-extracting-email-address-from-html-file/#findComment-775738 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.