Jump to content

Parsing email raw text


Matazn

Recommended Posts

Hello World! I am having difficulties with a project that I am doing. The project instructions are below.

 

 

Parsing an Email

 

What do I mean by parse and email?  Go to Gmail, open an incoming email, and do "View Original" from the drop down menu in the upper right corner of the email.

 

You should see a text file of a raw email.  The goal is to parse that and extract relevant fields.

 

Your Goals:

1.) Take in an email and parse it into relevant fields.  Figure out what those relevant fields are.

2.) Create a good foundation.  Try to write extensible/maintainable code.

3.) When you're done, list out next steps you'd take in your implementation.  Eg, I'd focus on handling different MIME types in the body, etc.

4.) Don't use a library that does the parsing for you (we know we're making you reinvent the wheel here), but feel free to use any non-email-specific libraries you want.

 

I've gotten the coding for the interface done, I just need to do the parse part.

 

Right now this is what I have.

 

list($_1stpt, $_2ndpt, $_3rdpt) = split('["html; charset=ISO-8859-1/"\--]', $email);

echo "1st string: $_1stpt; 2nd string: $_2ndpt; 3rd string: $_3rdpt;<br />\n";

 

$email stores the raw text inputted in a plain text email.

 

I'm using the following example.

 

$email = MIME-Version: 1.0

Received: by 11.111.111.60 with HTTP; Mon, 10 Oct 2011 16:25:25 -0700 (PDT)

Date: Mon, 10 Oct 2011 19:25:25 -0400

Delivered-To: JohnDoe@gmail.com

Message-ID: <CALvS2DburDwshW39qKbX09Q4aDX=tnhmvLMjD52YM2Kkydvrww@mail.gmail.com>

Subject: Program_Template

From: John Doe <JohnDoe@gmail.com>

To: JohnDoe@gmail.com

Content-Type: multipart/alternative; boundary=000325557fde0fcc6804aefa1bf0

 

--000325557fde0fcc6804aefa1bf0

Content-Type: text/plain; charset=ISO-8859-1

 

Hello World!

 

Regards,

 

Self

 

--000325557fde0fcc6804aefa1bf0

Content-Type: text/html; charset=ISO-8859-1

 

Hello World!<div><br></div><div>Regards,</div><div><br></div><div>Self</div>

 

--000325557fde0fcc6804aefa1bf0--

 

 

 

 

Pretty much I'm just trying to extract the Date: Subject: From: To: and "Hello World!<div><br></div><div>Regards,</div><div><br></div><div>Self</div>" parts of the message and set everything else to a Null string "".

 

 

Thanks in advance for your help! :D

Link to comment
Share on other sites

Also, in my following lines of code, I'm trying to search for both html; charset=ISO-8859-1 AND -- (double hyphen).

 

list($_1stpt, $_2ndpt, $_3rdpt) = split('["html; charset=ISO-8859-1/"\--]', $email);

echo "1st string: $_1stpt; 2nd string: $_2ndpt; 3rd string: $_3rdpt;<br />\n";

 

Which (in my theory) should separate the three parts into the following.

 

 

$_1stpt =  MIME-Version: 1.0

Received: by 10.220.189.68 with HTTP; Mon, 10 Oct 2011 16:25:25 -0700 (PDT)

Date: Mon, 10 Oct 2011 19:25:25 -0400

Delivered-To: mathewasantiago@gmail.com

Message-ID: <CALvS2DburDwshW39qKbX09Q4aDX=tnhmvLMjD52YM2Kkydvrww@mail.gmail.com>

Subject: Program_Template

From: Mathew Santiago <mathewasantiago@gmail.com>

To: mathewasantiago@gmail.com

Content-Type: multipart/alternative; boundary=000325557fde0fcc6804aefa1bf0

 

--000325557fde0fcc6804aefa1bf0

Content-Type: text/plain; charset=ISO-8859-1

 

Hello World!

 

Regards,

 

Self

 

--000325557fde0fcc6804aefa1bf0

Content-Type: text/html; charset=ISO-8859-1

 

 

$_2ndpt =  Hello World!<div><br></div><div>Regards,</div><div><br></div><div>Self</div>

 

$_3rdpt = --000325557fde0fcc6804aefa1bf0--

 

I'm just having serious logic/syntax difficulties. I just wanted to clear that up just in case anyone asked.

 

Link to comment
Share on other sites

It's better to not use regular expressions for this. Well, at least not yet.

 

All emails are constructed in a specific fashion:

Header: Value
Header: Value
Header: Value

Content

That is,

- Multiple header lines in the form of the header name (no spaces), maybe spaces, a colon, maybe spaces, and then the value up until the end of the line

- An empty line

- The email content. If the email is multipart then there's a specific form for this too but apparently you don't need to handle this (though you are supposed to explain how you would)

 

Start by reading header lines. When you hit an empty line, stop reading headers and pull the rest of the email out as the content. The headers can be split fairly easily without a regex but it's excusable to use one anyways.

Link to comment
Share on other sites

Well, if you have the entire email as a string, you can break it apart on newlines (\n). Then loop over all the lines: at the start of the loop you're parsing headers and can break lines apart on colons to get header/value pairs - don't forget to remove leading and trailing whitespace from them. When you find an empty line you exit the loop.

 

When you have a header/value pair, look at the header and decide whether it's one you care about. If so, store the value somewhere. After the loop you can print out those values.

Link to comment
Share on other sites

Ahh I see... Maybe I should try it your way! This is the code I ended up doing (before I read the post).

 

Page.php

 

 

<?php

//This program was created by Mathew Santiago and his friend google.

 

//This is to verify whether you've entered text or not

 

$email = "";

 

if (isset($_POST['email']))

$email = fix_string($_POST['email']);

 

$fail = validate_email($email);

 

 

function validate_email($email)

{

 

if ($email == "") return "No Email was entered<br />";

else if (!((strpos($email, ".") > 0) &&

(strpos($email, "@") > 0)) ||

preg_match("/[^a-zA-Z0-9.@_-]/", $email))

return "The Email is invalid<br />";

return "";

echo $email;

}

 

 

echo "<html><head><title>An Example Form</title>";

if ($fail == "")

{

echo "</head><body>Form data successfully validated:

$email.</body></html>";

 

// This is where you would enter the posted fields into a database

exit;

}

 

// Now output the HTML and JavaScript code

?>

 

<style>.signup { border: 1px solid #999999;

font: normal 14px helvetica; color:#492842; }

</style>

<script type="text/javascript">

<style>.signup { border: 1px solid #999999;

font: normal 14px helvetica; color:#444444; }</style>

<script type="text/javascript">

function validate(form)

{

 

fail = validateEmail(form.email.value)

if (fail == "") return true

else { alert(fail); return false }

 

}

 

//Sets up borders for Parse Monster and text area to input raw email text

 

 

</script></head><body>

<table class="signup" border="0" cellpadding="2"

cellspacing="5" bgcolor="#eeeeee">

<th colspan="2" align="center">Parse Monster!</th>

<tr><td colspan="2">Please begin the parsing process by inputing<br />

desired text below to be parsed! =D

<form method="post" action="parse.php"

onSubmit="return validate(this)">

 

 

<br /><br /><br />

<textarea id="email" name="email" style="margin-left: 2px;

margin-right: 2px;

width: 300px; margin-top: 2px; margin-bottom: 2px; height: 32px;

"></textarea>

</tr><tr><td colspan="2" align="center">

<input type="submit" value="Parse" /></td>

</tr></form></table>

 

parse.php

 

<?php

 

validate_email($email);

 

$email = $_POST['email'];

 

echo "<br><br>--------------------Inputed Email--------------------<br><br>" . $email;

 

 

 

echo '<br><br><br><br>--------------------Parsed Email--------------------';

 

/*Separates email into manageable parts to use.

Repeat process of pulling text from strings until all required fields are found.*/

$partone =  substr($email, strripos($email,"html; charset=ISO-8859-1")+strlen("html; charset=ISO-8859-1"));

 

$get_from = substr($email, strripos($email,"From:")+strlen("From:"));

 

$get_date = substr($email, strripos($email,"Date:")+strlen("Date:"));

 

$get_sub = substr($email, strripos($email,"Subject:")+strlen("Subject:"));

//Separates further

$parttwo = explode("--",$partone);

 

$from = explode("To:", $get_from);

 

$to = explode("Content", $from[1]);

 

$date = explode("Delivered", $get_date);

 

$sub = explode ("From:", $get_sub);

 

//Output all the parsed data

echo "<br /> Date: " . $date[0];

echo "<br /> From: " . $from[0];

echo "<br /> To: " . $to[0];

echo "<br /> Subject: " . $sub[0];

 

echo "<br /><br />" . $parttwo[0] . "<br />";

?>

 

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.