Matching Certain tags

cooldude832 · November 13, 2007

I want to use preg_split to split a page about its <div><table><tr><td> tags, so I need a pattern that will match <div> or <table> or <tr>< or <td> any ideas and it has to be able to also handle the fact that a tag could have a styling on it or a class etc Ithink i need something like <div*> but I don't know the rest of it

Lumio · November 13, 2007

<?php
   $code = "Hello\n<div align=\"center\">foo=bar</div>";
   $matches = preg_split('/(\<.*?div.*?\>|\<.*?table.*?\>|\<.*?tr.*\>|\<.*?td.*\>)/', $code, -1, PREG_SPLIT_DELIM_CAPTURE);
   print_r($matches);
?>

A thanks would be nice

cooldude832 · November 13, 2007

i'll try it and get back to you.

effigy · November 13, 2007

/(<(?:div|t(?:able|[rd]))[^>]*>)/

cooldude832 · November 13, 2007

that is working great, but now my issue is I somehow have to clear out all the css and all the <script> tags as these are carrying over with junk data. Any ideas?

effigy · November 13, 2007

Loop through the array and:

1. Remove <script...>...</script>.

2. Remove style="..." .

cooldude832 · November 13, 2007

well my first issue was I wanted to remove all the info pre the body tag, I did it using explode, but not all body tags are all lower case, and some had some issues, again a regex issue I tried "\<body*>\"; no good, if you got an idea to do a preg_split at that I'd love to see it,

that clears up some of it, but then in body <script> tags also are need to remove, my goal is to strip a page of everythign but container elements (<div>,<table><tr><td>)) whcih that pattern is doing for me, but then also kill all the script/css tags as those are special cases so I guess I need to find a replace ment for

effigy · November 13, 2007

What about something like this?

<pre>
<?php
   $data = <<<DATA
<html>
	<head>
		<title>Title</title>
	</head>
	<body>
		<font>Font Tag</font>
		<div>Div Content</div>
		<div id="1">More Div Content</div>
		<b>Bold</b>
		<table>
			<tr>
				<td>A Cell</td>
			</tr>
		</table>
		<hr>
	</body>
</html>
DATA;
### Split on the begin/end tags of what is desired
### and pull some content along.
$matches = preg_split(
	'%(</?(?:div|t(?:able|[rd]))[^>]*>[^<]*)%',
	$data,
	-1,
	PREG_SPLIT_DELIM_CAPTURE
);
### For each match...
$num_matches = count($matches);
for ($i = 0; $i < $num_matches; $i++) {
	### Strip unwanted tags.
	$matches[$i] = strip_tags($matches[$i], '<div><table><tr><td>');
	### If the entry doesn't start with a "<" (tag) it wasn't
	### included in our split; thus, not desired.
	if (strpos($matches[$i], '<') !== 0) {
		unset($matches[$i]);
	}
	### Otherwise, escape it for viewing purposes.
	else {
		$matches[$i] = htmlspecialchars($matches[$i]);
	}
}
### Display.
print_r($matches);
?>
</pre>

cooldude832 · November 13, 2007

that is working, now I just want to build some sort of multi dimensonal array of the data based on tag depth which I can figrue out, note I subbed back in the < > for the < and > as its easier to type, but i did it post your thing, I'll pm you with the final result if you interested in it.

effigy · November 13, 2007

No need to PM, post it here; others may be interested in the solution.

Sign In

Matching Certain tags

Recommended Posts

cooldude832

Link to comment

Share on other sites

Lumio

Link to comment

Share on other sites

cooldude832

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

cooldude832

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

cooldude832

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

cooldude832

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information