Jump to content

Strip Tags INSIDE Another (allowed) Tag


bschultz
Go to solution Solved by bschultz,

Recommended Posts

I need to strip some html tags out of an uploaded string of code.  I need to keep the <td> tags...but some code that is being uploaded include <p> tags INSIDE the <td> tag.

 

How would I go about stripping ALL other tags inside these allowed tags: <td> <tr><table> 

Link to comment
Share on other sites

Do not use strip_tags(). This function mangles the user input based on a very primitive mechanism. If you're lucky, it will only remove the parts you want to remove. But chances are it will cut off the input somewhere, either because the markup is invalid, or because the function is simply too stupid to understand the markup.

 

Why strip_tags() is still around and gets recommended is beyond me. Do yourself a favor and use a proper filter like HTML Purifier

Link to comment
Share on other sites

copy and paste the roster from the schools website into a wysiwyg text box on my site.  That entry runs through HTML Purifier to clean up formatting other than the table tr and td tags...then put into an array.  The array is then split by < td > to find each inidividual roster entry (number, name, height, weight etc.).

 

I was using strip_tags to get rid of everything other than table tr and td...but a new school had embedded p tags inside the td tag...which killed the import.  HTML Purifier is cleaning up the entered code...but I don't know where the extra whitespace is coming from.

 

Still haven't had a chance to play with typecast, but will soon.

Link to comment
Share on other sites

No two schools put things in the same order or with the same div or column names.  The method I am using lets the end user select the order of the roster they are loading at that time based on which column is which.

 

IE: column 1 is number, column 2 is name, column 3 is a picture of the payer (which we don't use and ignore), column 4 is their height.

 

Next week, it might be column 1 name, column 2 number...etc.

 

One week the school may only provide a Word document in table format...

 

Like I said, there are no set formats of what is being copied and pasted in.  As long as the info being gathered is in table form (which is the case 95+ percent of the time), my system can adapt. 

 

If you can show me how DOM or html_simple_dom would work better...I'd love to hear it.

Edited by bschultz
Link to comment
Share on other sites

A quick example (requires simple_html_dom.php found here)

<?php

// include simple_html_dom
require_once 'simple_html_dom.php';

// function scraps data from $url and returns data defined in $elements
function findPlayerInfoByElements($url, $elements = array())
{
	// load page into simpleHtmlDom
	$html = file_get_html($url);

	// get the data alias keys. This will be used as the keys to associative array return by the function
	$aliases = array_keys($elements);

	// for each element 
	$data = array();
	foreach($elements as $element)
	{
		// find all data by element
		$columnDataFound = $html->find($element);

		// if the element was found
		if($columnDataFound)
		{
			// return the value of the element as plain text - removes any HTML
			$data[] = array_map(function($v) { return trim($v->plaintext); }, $columnDataFound);
		}
	}

	// format the players array
	$players = array();
	// looping over the data add each players info into seperate associative arrays
	for ($i = 0; $i < count($data[0]); $i++) {
		$info = array();
		foreach($aliases as $k => $alias)
			$info[$alias] = $data[$k][$i];

		$players[] = $info;
	}
	// unset the orginal data
	unset($data);

	// return the players info
	return $players;
}

// url to scrap roster info from
$roster_url = 'http://www.seahawks.com/team/roster.html';

/* 
Provide findPlayerInfoByElements()  function
 - url to scrap roster table
 - provide an array of elements to get data required, eg jersey no, player name, hieght and weight
*/
$elements = array(
	'no'     => 'td.col-jersey', // gets the jersey numbers from the <td> element with the class of col-jersey
	'name' 	 => 'td.col-name',   // gets the players name from the <td> element with the class of col-name
	'height' => 'td.col-height', // their height from the <td> element with the class of col-height
	'weight' => 'td.col-weight', // their weight from the <td> element with the class of col-weight
);

// returns each players info in an associative array
$players = findPlayerInfoByElements($roster_url, $elements);

printf('<pre>%s</pre>', print_r($players, true));

Change $roster_url will your schools roster page

Modify $elements array with the HTML elements you need to find the data from.

 

$players will contain info for each player in the roster.

Link to comment
Share on other sites

  • Solution

Typecast worked for the int values...but all the other fields had whitespace too.  I ran trim on the foreach for the array...and it didn't work.  I ran trim on the insert of the mysql table for each field...and it worked.

 

Still don't know where the whitespace came from, but I got rid of it.

 

Thanks!

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.