Jump to content

Extracting certain data from large strings?


theprovider

Recommended Posts

Ok, I'm a bit experienced with PHP and Regex, but this is just over my head.  :(

 

Here is an example for the original string input:

 

<td><font face="verdana,sans-serif" size=1> 153277</td>

<td> <a href="/url/"><font face="verdana,sans-serif" size=1 color=#000000>DATA< /a ></td>

 

I would like to scrap everything except '153277', '/url/', and 'DATA' -- and I would prefer these to be in seperate strings.

 

For example, $number = "153277", $url = "/url/", and $data = "DATA"

 

What would be the regex and PHP code to do this? I'm just completely lost.. :(

 

Forgive my newbieness.

Link to comment
Share on other sites

$string = "<td><font face="verdana,sans-serif" size=1> 153277</td>
<td> <a href="/url/"><font face="verdana,sans-serif" size=1 color=#000000>DATA< /a ></td>";
$regs = array();
ereg("(\d*)</td>\n.*href=\"(.*)\".*#.*>(.*)<\s?/a", $string, $regs);

$number = $regs[1];
$url = $regs[2];
$data = $regs[3];

 

 

Thanks! That's exactly what I needed, however, in my idiocy, I asked the question incorrectly.  :(

 

Here is the problem:

 

I have a full HTML document in my string, and I forgot to ask how to remove everything in the document except what is between the <td></td> tags.

To illustrate:

 

What I want to remove:

<html><head><title>Blah blah</title></head><body>Bunch of crap <table border=0 cellpadding=1 cellspacing=0 bgcolor=#FFFFFF width=270><tr><th>More crap</th></tr><tr><th>And some more</th></tr>

 

What I want to iterate through and store in variables:

<tr bgcolor=#333333>
<td><font face="verdana,sans-serif" size=2 color=E8E8E8> NUMBER</td>
<td><font face="verdana,sans-serif" size=2 color=E8E8E8> DATA</a></td>
</tr>
<tr bgcolor="#F4F4F4">
<td><font face="verdana,sans-serif" size=1> NUMBER2</td>
<td> <a href="/URL/"><font face="verdana,sans-serif" size=1 color=#000000>DATA2</a></td>
</tr>
<tr>
<td><font face="verdana,sans-serif" size=1> NUMBER3</td>
<td> <a href="/URL2/"><font face="verdana,sans-serif" size=1 color=#000000>DATA3</a></td>
</tr>

 

To be honest, I don't even need the URLs. (Notice the first result has no URL associated with it)

 

I want to scrap everything except the table rows (the meat and potatoes), then iterate through each row and store each NUMBER and DATA.

The number of rows will vary each time, and I need to associate the NUMBER with the DATA.

I don't know if an array could do what I need, but I can readily use SQL if necessary. In fact, that might be preferable.

 

I'm sorry if I'm not being clear, I've been up for quite a while and I can't seem to formulate an intelligent question.  :-\

If there is any more information I can provide to help you help me, don't hesitate to ask.  :P

Link to comment
Share on other sites

Do you really need to scrap (replace) the unwanted data, or do you want to extract (match) the desired data? The latter is easier and--I think--what you want to do.

 

I suppose either would work, right? I'll definitely take the easier route (who wouldn't)..

The only important thing is that I can extract the NUMBER and DATA, and store them in a database.

I can do the SQL easily, but how would you recommend I extract the variables?

 

I am at your mercy.  :P

Link to comment
Share on other sites

<pre>
<?php

$data = <<<DATA
<tr bgcolor=#333333>
<td><font face="verdana,sans-serif" size=2 color=E8E8E8> NUMBER</td>
<td><font face="verdana,sans-serif" size=2 color=E8E8E8> DATA</a></td>
</tr>
<tr bgcolor="#F4F4F4">
<td><font face="verdana,sans-serif" size=1> NUMBER2</td>
<td> <a href="/URL/"><font face="verdana,sans-serif" size=1 color=#000000>DATA2</a></td>
</tr>
<tr>
<td><font face="verdana,sans-serif" size=1> NUMBER3</td>
<td> <a href="/URL2/"><font face="verdana,sans-serif" size=1 color=#000000>DATA3</a></td>
</tr>	
DATA;

preg_match_all('#<td[^>]*>(.*?)</td>#s', $data, $matches);
array_shift($matches);
foreach ($matches[0] as &$match) {
	$match = strip_tags($match);
	$match = str_replace(' ', '', $match);
}
print_r($matches);
?>
</pre>

Link to comment
Share on other sites

<pre>
<?php

$data = <<<DATA
<tr bgcolor=#333333>
<td><font face="verdana,sans-serif" size=2 color=E8E8E8> NUMBER</td>
<td><font face="verdana,sans-serif" size=2 color=E8E8E8> DATA</a></td>
</tr>
<tr bgcolor="#F4F4F4">
<td><font face="verdana,sans-serif" size=1> NUMBER2</td>
<td> <a href="/URL/"><font face="verdana,sans-serif" size=1 color=#000000>DATA2</a></td>
</tr>
<tr>
<td><font face="verdana,sans-serif" size=1> NUMBER3</td>
<td> <a href="/URL2/"><font face="verdana,sans-serif" size=1 color=#000000>DATA3</a></td>
</tr>	
DATA;

preg_match_all('#<td[^>]*>(.*?)</td>#s', $data, $matches);
array_shift($matches);
foreach ($matches[0] as &$match) {
	$match = strip_tags($match);
	$match = str_replace(' ', '', $match);
}
print_r($matches);
?>
</pre>

 

Thanks! That did it (for the most part) -- There's still a few quirks, but it has nothing to do with your code.

I'm now one step closer, and I'll keep fiddling with it before asking more questions.  :P

 

Thanks again, you're a lifesaver!

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.