Jump to content

Need help splitting string up!


rdrews

Recommended Posts

Ok, here is a sample of the string I need broken up...

 

******Begin Sample*********

 

<div class="PubSectionHeader"><font size="+0">Bill Smith 48415126</font></div>

<br>

<a name="NDR4pDQoQv5rq1MQk"></a>

<div class="PubNote">

<div class="PubNoteContentArea">Called Customer and blah blah blah blah. abc. 10/12/09<blockquote class="gn_c"> She doesn&#39;t want to be contacted on this number but said okay. 10/12/09</blockquote></div>

</div> <a name="NDRLWDAoQ2qq687wk"></a>

<div class="PubNote">

<div class="PubNoteContentArea">Spoke to customer on alternate number. She said she blah blah blah blah blah. I told her as long as we receive it within a week, no problems. abc 9/18/09</div>

</div> <a name="NDRykDAoQv__VvaUk"></a>

<div class="PubNote">

 

<div class="PubNoteContentArea">Left message on premise about issues. abc 7/7/09</div>

</div> <a name="NDQrlDAoQxbL1_54k"></a>

<div class="PubNote">

<div class="PubNoteContentArea">this is another part of the comment that I need.  I think I am going about things the wrong way. abc 6/17/09<blockquote class="gn_c"> She said she&#39;ll call tech support. abc 6/17/09</blockquote></div>

</div> <a name="NDQopDQoQ18mwt5wk"></a>

<div class="PubNote">

<div class="PubNoteContentArea">Called customer about issue. Left message on alternate number. abc 6/9/09 </div>

</div> <a name="NDQduDQoQoNiBhZMk"></a>

 

<div class="PubNote">

<div class="PubNoteContentArea">call was returned &#39;. abc 5/11/09<blockquote class="gn_c"> Customer said she doesn&#39;t want to be transferred to tech support (even though she said her system doesn&#39;t work). She asked why we&#39;re so hard to get a hold of. I let her know she can call the number to contact us. abc 5/11/09<br>Removed from list. abc 5/11/09</blockquote></div>

</div> <a name="NDQ7QIgoQ9aWRsJgj"></a>

<div class="PubNote">

<div class="PubNoteContentArea">Big problems. abc </div>

 

</div> <a name="SDUThIgoQrLX5r5gj"></a>

<div class="PubSectionHeader"><font size="+0">Mark and Larry 700002</font></div>

<br>

<a name="NDSOkDAoQ5-TduIgk"></a>

<div class="PubNote">

<div class="PubNoteContentArea"><span>update per user,  told the customer if he gets two remotes we can wave fee  . bill 04/08/09 2:00 pm </span><div> </div></div>

</div> <a name="NDQmeDQoQjdS-uIgk"></a>

<div class="PubNote">

<div class="PubNoteContentArea">yada yada yada yada .  <div>bill 04/08/09</div></div>

</div> <a name="NDSGpIgoQ56b7r5gj"></a>

 

<div class="PubNote">

<div class="PubNoteContentArea">another note here. abc </div>

</div> <a name="SDQqRIwoQyIDmr5gj"></a>

 

*********End Sample**************

 

 

Ok...all of this is HTML source from Google Docs that I saved into a large .txt file.  Basically, I need all of this broken into two parts.  Account numbers and comments. 

 

All the account numbers are found right after the name which is AFTER the <div class="PubSectionHeader"><font size="+0"> and BEFORE the </font>

 

And all the comments look to be between <div class="PubNoteContentArea"> and </div> but there are usually several separate notes for each account number.  So a single account number may have several <div class="PubNoteContentArea"> note areas. 

 

Ideally I would like to run through this whole text file and end up with two arrays.  One array would be all account numbers (accountNum[0] = "123", accountNum[1] = "456", etc...) and the other array would be all the notes (notes[0] = "notes for account 123", notes[1] = "notes for account 456", etc...) but if it's easier/makes more sense to do one array where the first element would be the account number, the second the notes for the account in the first element, the third, another account number, etc.... then I can work with that too. 

 

I realize there is some additional formatting in between some of the <div class="PubNoteContentArea"> note areas like "<blockquote class="gn_c">" and maybe some other stuff but for now I'm not really worried about all that.  I can maybe put the whole file into word or excel and do a few find/replaces to get rid of some of that. 

 

Up to this point I have loaded the whole file contents into a string and then split the string into a character array where every character (including whitespaces) is an element in the array.  I then tried to start messing with the regex part of it and decided I wasn't getting anywhere after a while of playing around with it.  Any help is greatly appreciated.  If I didn't explain things very well feel free to ask me to clarify. 

 

Thanks!

Link to comment
Share on other sites

You may well be better off using some kind of DOMDocument but since you posted under PHP Regex...

 

$pattern = '#<div class="PubSectionHeader"><font size="\+0">([a-z ]+?) ([0-9]+)</font></div>#is';
preg_match_all($pattern, $input, $matches);

echo '<pre>';
print_r($matches);
echo '</pre>';

 

Only thrown together quickly but it should more or less work. There's probably other characters you will need to consider for the first character class, for example a dash (-) for double barrelled names and an apostrophe (`) for O`Reilly etc.

Link to comment
Share on other sites

You may well be better off using some kind of DOMDocument but since you posted under PHP Regex...

 

$pattern = '#<div class="PubSectionHeader"><font size="\+0">([a-z ]+?) ([0-9]+)</font></div>#is';
preg_match_all($pattern, $input, $matches);

echo '<pre>';
print_r($matches);
echo '</pre>';

 

Only thrown together quickly but it should more or less work. There's probably other characters you will need to consider for the first character class, for example a dash (-) for double barrelled names and an apostrophe (`) for O`Reilly etc.

 

Awesome, thanks!  That looks like it is very close to what I am looking for.  I will play with it for a little bit and come back if I have anymore issues. 

Link to comment
Share on other sites

You may well be better off using some kind of DOMDocument but since you posted under PHP Regex...

 

$pattern = '#<div class="PubSectionHeader"><font size="\+0">([a-z ]+?) ([0-9]+)</font></div>#is';
preg_match_all($pattern, $input, $matches);

echo '<pre>';
print_r($matches);
echo '</pre>';

 

Only thrown together quickly but it should more or less work. There's probably other characters you will need to consider for the first character class, for example a dash (-) for double barrelled names and an apostrophe (`) for O`Reilly etc.

 

Ok...so with your help I get $matches[2] which holds all the account numbers so that's step one.  I've been working on step two and can't quite get there...apparently I need to be spoon fed.  I'm having trouble figuring out how to get ALL the notes under a particular account number into one element of an array.  If I use something like

 

$pattern = '#<div class="PubNoteContentArea">.</div>#is';

preg_match_all($pattern, $contents, $matches);

 

(which I haven't gotten working quite yet) won't that just put each note between the div tags into a separate element?  How do I tell it to combine all the notes after one account number and before the next account number into one element?  I know that cags mentioned possibly using DOMDocument().  Should I post this somewhere else or can this be done using regex? 

 

Thanks again for the help!

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.