Matazn Posted October 14, 2011 Share Posted October 14, 2011 Hello World! I am having difficulties with a project that I am doing. The project instructions are below. Parsing an Email What do I mean by parse and email? Go to Gmail, open an incoming email, and do "View Original" from the drop down menu in the upper right corner of the email. You should see a text file of a raw email. The goal is to parse that and extract relevant fields. Your Goals: 1.) Take in an email and parse it into relevant fields. Figure out what those relevant fields are. 2.) Create a good foundation. Try to write extensible/maintainable code. 3.) When you're done, list out next steps you'd take in your implementation. Eg, I'd focus on handling different MIME types in the body, etc. 4.) Don't use a library that does the parsing for you (we know we're making you reinvent the wheel here), but feel free to use any non-email-specific libraries you want. I've gotten the coding for the interface done, I just need to do the parse part. Right now this is what I have. list($_1stpt, $_2ndpt, $_3rdpt) = split('["html; charset=ISO-8859-1/"\--]', $email); echo "1st string: $_1stpt; 2nd string: $_2ndpt; 3rd string: $_3rdpt;<br />\n"; $email stores the raw text inputted in a plain text email. I'm using the following example. $email = MIME-Version: 1.0 Received: by 11.111.111.60 with HTTP; Mon, 10 Oct 2011 16:25:25 -0700 (PDT) Date: Mon, 10 Oct 2011 19:25:25 -0400 Delivered-To: JohnDoe@gmail.com Message-ID: <CALvS2DburDwshW39qKbX09Q4aDX=tnhmvLMjD52YM2Kkydvrww@mail.gmail.com> Subject: Program_Template From: John Doe <JohnDoe@gmail.com> To: JohnDoe@gmail.com Content-Type: multipart/alternative; boundary=000325557fde0fcc6804aefa1bf0 --000325557fde0fcc6804aefa1bf0 Content-Type: text/plain; charset=ISO-8859-1 Hello World! Regards, Self --000325557fde0fcc6804aefa1bf0 Content-Type: text/html; charset=ISO-8859-1 Hello World!<div><br></div><div>Regards,</div><div><br></div><div>Self</div> --000325557fde0fcc6804aefa1bf0-- Pretty much I'm just trying to extract the Date: Subject: From: To: and "Hello World!<div><br></div><div>Regards,</div><div><br></div><div>Self</div>" parts of the message and set everything else to a Null string "". Thanks in advance for your help! Quote Link to comment Share on other sites More sharing options...
Matazn Posted October 14, 2011 Author Share Posted October 14, 2011 Also, in my following lines of code, I'm trying to search for both html; charset=ISO-8859-1 AND -- (double hyphen). list($_1stpt, $_2ndpt, $_3rdpt) = split('["html; charset=ISO-8859-1/"\--]', $email); echo "1st string: $_1stpt; 2nd string: $_2ndpt; 3rd string: $_3rdpt;<br />\n"; Which (in my theory) should separate the three parts into the following. $_1stpt = MIME-Version: 1.0 Received: by 10.220.189.68 with HTTP; Mon, 10 Oct 2011 16:25:25 -0700 (PDT) Date: Mon, 10 Oct 2011 19:25:25 -0400 Delivered-To: mathewasantiago@gmail.com Message-ID: <CALvS2DburDwshW39qKbX09Q4aDX=tnhmvLMjD52YM2Kkydvrww@mail.gmail.com> Subject: Program_Template From: Mathew Santiago <mathewasantiago@gmail.com> To: mathewasantiago@gmail.com Content-Type: multipart/alternative; boundary=000325557fde0fcc6804aefa1bf0 --000325557fde0fcc6804aefa1bf0 Content-Type: text/plain; charset=ISO-8859-1 Hello World! Regards, Self --000325557fde0fcc6804aefa1bf0 Content-Type: text/html; charset=ISO-8859-1 $_2ndpt = Hello World!<div><br></div><div>Regards,</div><div><br></div><div>Self</div> $_3rdpt = --000325557fde0fcc6804aefa1bf0-- I'm just having serious logic/syntax difficulties. I just wanted to clear that up just in case anyone asked. Quote Link to comment Share on other sites More sharing options...
requinix Posted October 14, 2011 Share Posted October 14, 2011 It's better to not use regular expressions for this. Well, at least not yet. All emails are constructed in a specific fashion: Header: Value Header: Value Header: Value Content That is, - Multiple header lines in the form of the header name (no spaces), maybe spaces, a colon, maybe spaces, and then the value up until the end of the line - An empty line - The email content. If the email is multipart then there's a specific form for this too but apparently you don't need to handle this (though you are supposed to explain how you would) Start by reading header lines. When you hit an empty line, stop reading headers and pull the rest of the email out as the content. The headers can be split fairly easily without a regex but it's excusable to use one anyways. Quote Link to comment Share on other sites More sharing options...
Matazn Posted October 15, 2011 Author Share Posted October 15, 2011 I understand what you mean, however, I just learned PHP 3 days ago and am unsure of what I could use to do that. I apologize for being a noob! Quote Link to comment Share on other sites More sharing options...
requinix Posted October 17, 2011 Share Posted October 17, 2011 Well, if you have the entire email as a string, you can break it apart on newlines (\n). Then loop over all the lines: at the start of the loop you're parsing headers and can break lines apart on colons to get header/value pairs - don't forget to remove leading and trailing whitespace from them. When you find an empty line you exit the loop. When you have a header/value pair, look at the header and decide whether it's one you care about. If so, store the value somewhere. After the loop you can print out those values. Quote Link to comment Share on other sites More sharing options...
Matazn Posted October 18, 2011 Author Share Posted October 18, 2011 Ahh I see... Maybe I should try it your way! This is the code I ended up doing (before I read the post). Page.php <?php //This program was created by Mathew Santiago and his friend google. //This is to verify whether you've entered text or not $email = ""; if (isset($_POST['email'])) $email = fix_string($_POST['email']); $fail = validate_email($email); function validate_email($email) { if ($email == "") return "No Email was entered<br />"; else if (!((strpos($email, ".") > 0) && (strpos($email, "@") > 0)) || preg_match("/[^a-zA-Z0-9.@_-]/", $email)) return "The Email is invalid<br />"; return ""; echo $email; } echo "<html><head><title>An Example Form</title>"; if ($fail == "") { echo "</head><body>Form data successfully validated: $email.</body></html>"; // This is where you would enter the posted fields into a database exit; } // Now output the HTML and JavaScript code ?> <style>.signup { border: 1px solid #999999; font: normal 14px helvetica; color:#492842; } </style> <script type="text/javascript"> <style>.signup { border: 1px solid #999999; font: normal 14px helvetica; color:#444444; }</style> <script type="text/javascript"> function validate(form) { fail = validateEmail(form.email.value) if (fail == "") return true else { alert(fail); return false } } //Sets up borders for Parse Monster and text area to input raw email text </script></head><body> <table class="signup" border="0" cellpadding="2" cellspacing="5" bgcolor="#eeeeee"> <th colspan="2" align="center">Parse Monster!</th> <tr><td colspan="2">Please begin the parsing process by inputing<br /> desired text below to be parsed! =D <form method="post" action="parse.php" onSubmit="return validate(this)"> <br /><br /><br /> <textarea id="email" name="email" style="margin-left: 2px; margin-right: 2px; width: 300px; margin-top: 2px; margin-bottom: 2px; height: 32px; "></textarea> </tr><tr><td colspan="2" align="center"> <input type="submit" value="Parse" /></td> </tr></form></table> parse.php <?php validate_email($email); $email = $_POST['email']; echo "<br><br>--------------------Inputed Email--------------------<br><br>" . $email; echo '<br><br><br><br>--------------------Parsed Email--------------------'; /*Separates email into manageable parts to use. Repeat process of pulling text from strings until all required fields are found.*/ $partone = substr($email, strripos($email,"html; charset=ISO-8859-1")+strlen("html; charset=ISO-8859-1")); $get_from = substr($email, strripos($email,"From:")+strlen("From:")); $get_date = substr($email, strripos($email,"Date:")+strlen("Date:")); $get_sub = substr($email, strripos($email,"Subject:")+strlen("Subject:")); //Separates further $parttwo = explode("--",$partone); $from = explode("To:", $get_from); $to = explode("Content", $from[1]); $date = explode("Delivered", $get_date); $sub = explode ("From:", $get_sub); //Output all the parsed data echo "<br /> Date: " . $date[0]; echo "<br /> From: " . $from[0]; echo "<br /> To: " . $to[0]; echo "<br /> Subject: " . $sub[0]; echo "<br /><br />" . $parttwo[0] . "<br />"; ?> Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.