string search and replace

gevans · December 22, 2008

Hey guys, I'm currently using this function;

    function input_rte($input, $title = "", $edit = FALSE){
        if('get_magic_quotes_gpc') $input = stripslashes($input);
        $input = mysql_real_escape_string($input);
        $input = str_replace($this->old_string,'',$input);
        if(strpos($input,'<p>') !== FALSE && strpos($input,'</p>') !== FALSE) $input = str_replace($this->ie_fix1,$this->ie_fix2,$input);
        while(substr($input,-4) == '<br>' || substr($input,-5) == '<br/>' || substr($input,-6) == '<br />'){
            if(substr($input,-4 == '<br>')) $input = substr($input,0,-4);
            elseif(substr($input,-5 == '<br/>')) $input = substr($input,0,-5);
            elseif(substr($input,-6 == '<br />')) $input = substr($input,0,-6);
        }
        $input = str_replace('<br><br><br><br>','<br><br>',$input);
        $input = str_replace('<br>','<br />',$input);
        if(strpos($input, '<img') !== FALSE && strpos($input, '<img alt') === FALSE && strpos($input, '<img  alt') === FALSE) $input = str_replace("<img","<img alt=\"$title - content image\"",$input);
        if(strpos($input, '.jpg"><img') !== FALSE) $input = str_replace(".jpg\"><img",".jpg\"><img class=\"second\"",$input);
        $input = trim($input);
        $input = urlencode($input);
        $input = ($input == "") ? NULL : $input;
        return $input; 
    }

to sort some POST data from a rich text editor.

The next adition to the function is required to search for a link to a pdf.

The html would look like this;

<a href="http://www.mydomain.com/dir/thefile.pdf">Read More</a>

Now what I'm trying to do (that I can't manage at the moment) is to find that bit of code and checking if it is a .pdf extension. If it is The html needs to be replaced with the following;

<div class="pdf">
<a href="the link in here" target="_blank" title="link text in here"><img class="left" src="images/pdf_download.png" alt="Download PDF" width="64" height="74" /></a>
<span class="title">title in here</span>
<span class="info">download pdf</span>
<a href="the link in here" target="_blank" title="link text in here" class="link">DOWNLOAD</a>
</div><div class="pdf-bot2"></div>

I'm not entirely sure the best way to do this. I was going to use a regular expression and preg replace. Anyone got any ideas what else i could do?

MadTechie · December 22, 2008

Do you mean something like this

<?php
$html = 'this is some stuff <a href="http://www.mydomain.com/dir/thefile.pdf">Read More</a> for update dating the <a href="http://youdomain.com/another.pdf">Other Stuff</a>html';
$html = preg_replace('%<a href=(["\'])(.*?\.pdf)\1>(.*)</a>%sim', "<div class=\"pdf\">\r\n<a href=\"the link in here\" target=\"_blank\" title=\"link text in here\"><img class=\"left\" src=\"images/pdf_download.png\" alt=\"Download PDF\" width=\"64\" height=\"74\" /></a>\r\n<span class=\"title\">title in here</span>\r\n<span class=\"info\">download pdf</span>\r\n<a href=\"\2\" target=\"_blank\" title=\"\3\" class=\"link\">DOWNLOAD</a>\r\n</div><div class=\"pdf-bot2\"></div>", $html );

echo $html;

?>

gevans · December 23, 2008

That looks perfect, haven't had a chance to test yet but will get you an update in an hour or so

Cheers

gevans · December 23, 2008

It's nearly there, it outputs the following

<div class="pdf">
<a href="the link in here" target="_blank" title="link text in here"><img class="left" src="images/pdf_download.png" alt="Download PDF" width="64" height="74" /></a>
<span class="title">title in here</span>
<span class="info">download pdf</span>
<a href="" target="_blank" title="" class="link">DOWNLOAD</a>

</div><div class="pdf-bot2"></div>

The first a href needs to be a link (not just text) and the link and title use those strange characters in place of the regular expression caught text

MadTechie · December 23, 2008

heres a quick update

<?php
$html = 'this is some stuff <a href="http://www.mydomain.com/dir/thefile.pdf">Read More</a> for update dating the <a href="http://youdomain.com/another.pdf">Other Stuff</a>html';
$html = preg_replace('%<a href=(["\'])(.*?\.pdf)\1>(.*)</a>%sim', "<div class=\"pdf\">\r\n<a href=\"\2\" target=\"_blank\" title=\"\3\"><img class=\"left\" src=\"images/pdf_download.png\" alt=\"Download PDF\" width=\"64\" height=\"74\" /></a>\r\n<span class=\"title\">\3</span>\r\n<span class=\"info\">download pdf</span>\r\n<a href=\"\2\" target=\"_blank\" title=\"\3\" class=\"link\">DOWNLOAD</a>\r\n</div><div class=\"pdf-bot2\"></div>", $html );

echo $html;

?>

what input are you using ? and what do you expect to see ?

gevans · December 23, 2008

I've got it sorted, the numbers representing the regular expression strings need a double back slash so;

\\2

rather than

\2

Thanks for all your help.

effigy · December 23, 2008

<pre>
<?php
$html = 'this is some stuff <a href="http://www.mydomain.com/dir/thefile.pdf">Read More</a> for updating the <a href="http://youdomain.com/another.pdf">Other Stuff</a>html';

$replace = <<<REPLACE
<div class="pdf">
<a href="$2" target="_blank" title="$3">
	<img class="left" src="images/pdf_download.png" alt="Download PDF" width="64" height="74" />
</a>
<span class="title">$3</span>
<span class="info">download pdf</span>
<a href="$2" target="_blank" title="$3" class="link">DOWNLOAD</a>
</div>
<div class="pdf-bot2"></div>
REPLACE;

$html = preg_replace(
	'%<a href=([\'"])?((?(1).+?|[^\s>]+)\.pdf)(?(1)\1)>(.*?)</a>%si',
	$replace,
	$html
);

echo htmlspecialchars($html);
?>
</pre>

gevans · December 23, 2008

Tried it with more complex input, didn't work as expected;

input html

<img alt="Pupil Launch - content image" src="http://thinking.uk.com/projects/ebp_cms/images/uploads/small/ks3_04.jpg">
<img src="http://thinking.uk.com/projects/ebp_cms/images/uploads/small/cbcv_03.jpg">
<div class="clearfix"></div>
<br />
<strong><span style="text-decoration: underline;">What the Challenge is all about</span></strong>
<br /><br />
The focus is on the website but the challenge involves a wide range of skills - research, presentation, innovative design, graphics, and project management skills are equally important. The Challenge can be used effectively to bring a work related learning element to many aspects of the curriculum including English (presentations); ICT; and Business Studies. It will also help to develop students' 'enterprise skills'.
<br /><br />
The <a href="http://www.portsmouthebp.co.uk">Education Business Partnership</a>
<br /><br />
<a href="http://thinking.uk.com/projects/ebp_cms/uploads/pdfs/autumn_07_1229439562_pdf.pdf">Website Challenge Entry Form</a>

output html

<img  alt="Pupil Launch - content image" src="http://thinking.uk.com/projects/ebp_cms/images/uploads/small/ks3_04.jpg"><img  src="http://thinking.uk.com/projects/ebp_cms/images/uploads/small/cbcv_03.jpg"><div class="clearfix"></div><br /><strong><span style="text-decoration: underline;">What the Challenge is all about</span></strong><br /><br />The focus is on the website but the challenge involves a wide range of skills - research, presentation, innovative design, graphics, and project management skills are equally important. The Challenge can be used effectively to bring a work related learning element to many aspects of the curriculum including English (presentations); ICT; and Business Studies. It will also help to develop students' 'enterprise skills'.<br /><br />The <div class="pdf">

   <a href="http://www.portsmouthebp.co.uk">Education Business Partnership</a><br /><br /><a href="http://thinking.uk.com/projects/ebp_cms/uploads/pdfs/autumn_07_1229439562_pdf.pdf" target="_blank" title="Website Challenge Entry Form">
      <img class="left" src="images/pdf_download.png" alt="Download PDF" width="64" height="74" />
   </a>
   <span class="title">Website Challenge Entry Form</span>
   <span class="info">download pdf</span>
   <a href="http://www.portsmouthebp.co.uk">Education Business Partnership</a><br /><br /><a href="http://thinking.uk.com/projects/ebp_cms/uploads/pdfs/autumn_07_1229439562_pdf.pdf" target="_blank" title="Website Challenge Entry Form" class="link">DOWNLOAD</a>

</div>
<div class="pdf-bot2"></div>

excpected output

<img  alt="Pupil Launch - content image" src="http://thinking.uk.com/projects/ebp_cms/images/uploads/small/ks3_04.jpg"><img  src="http://thinking.uk.com/projects/ebp_cms/images/uploads/small/cbcv_03.jpg"><div class="clearfix"></div><br /><strong><span style="text-decoration: underline;">What the Challenge is all about</span></strong><br /><br />The focus is on the website but the challenge involves a wide range of skills - research, presentation, innovative design, graphics, and project management skills are equally important. The Challenge can be used effectively to bring a work related learning element to many aspects of the curriculum including English (presentations); ICT; and Business Studies. It will also help to develop students' 'enterprise skills'.<br /><br />

The <a href="http://www.portsmouthebp.co.uk">Education Business Partnership</a><br /><br />

<div class="pdf">

   <a href="http://thinking.uk.com/projects/ebp_cms/uploads/pdfs/autumn_07_1229439562_pdf.pdf" target="_blank" title="Website Challenge Entry Form">
      <img class="left" src="images/pdf_download.png" alt="Download PDF" width="64" height="74" />
   </a>
   <span class="title">Website Challenge Entry Form</span>
   <span class="info">download pdf</span>
   <a href="http://www.portsmouthebp.co.uk">Education Business Partnership</a><br /><br /><a href="http://thinking.uk.com/projects/ebp_cms/uploads/pdfs/autumn_07_1229439562_pdf.pdf" target="_blank" title="Website Challenge Entry Form" class="link">DOWNLOAD</a>

</div>
<div class="pdf-bot2"></div>

effigy · December 23, 2008

%<a href=([\'"])?((?

?!\1).)+\.pdf)(?(1)\1)>(.*?)</a>%si

gevans · December 23, 2008

That worked perfectly. Any chance of a brief breakdown of what the regular expression is doing?

Actually, I know what it is doing, but an explanation of what part does what...

effigy · December 23, 2008

Actually, it needs another tweak in case the attributes are not quoted:

%<a href=([\'"])?((??!\1)[^>\s])+\.pdf)(?(1)\1)>(.*?)</a>%si

The first non-literal part of the regex looks for a single or double quote, which may not exist at all. Afterwards, it captures one character (that is not whitespace or ">") at a time, but only if it does not encounter the (optional) quote that it began with. In other words, if a single quote was found, match all of its contents up to the ending single quote; the same goes if a double quote was matched. If nothing was found, it stops at the end of the tag. It then backtracks to make sure the URL ends with ".pdf", matches the ending quote if one was found, the end of the tag, the rest of the content up to "</a>", and then "</a>" itself.

Keep in mind that this regex only works if no other attributes are present and the formatting is exact.

Here's a technical breakdown:

NODE EXPLANATION

----------------------------------------------------------------------

<a href= '<a href='

----------------------------------------------------------------------

( group and capture to \1 (optional

(matching the most amount possible)):

----------------------------------------------------------------------

['"] any character of: ''', '"'

----------------------------------------------------------------------

)? end of \1 (NOTE: because you're using a

quantifier on this capture, only the LAST

repetition of the captured pattern will be

stored in \1)

----------------------------------------------------------------------

( group and capture to \2:

----------------------------------------------------------------------

(?: group, but do not capture (1 or more

times (matching the most amount

possible)):

----------------------------------------------------------------------

(?! look ahead to see if there is not:

----------------------------------------------------------------------

\1 what was matched by capture \1

----------------------------------------------------------------------

) end of look-ahead

----------------------------------------------------------------------

[^>\s] any character except: '>', whitespace

(\n, \r, \t, \f, and " ")

----------------------------------------------------------------------

)+ end of grouping

----------------------------------------------------------------------

\. '.'

----------------------------------------------------------------------

pdf 'pdf'

----------------------------------------------------------------------

) end of \2

----------------------------------------------------------------------

(?(1) if back-reference \1 matched, then:

----------------------------------------------------------------------

\1 what was matched by capture \1

----------------------------------------------------------------------

| else:

----------------------------------------------------------------------

succeed

----------------------------------------------------------------------

) end of conditional on \1

----------------------------------------------------------------------

> '>'

----------------------------------------------------------------------

( group and capture to \3:

----------------------------------------------------------------------

.*? any character (0 or more times (matching

the least amount possible))

----------------------------------------------------------------------

) end of \3

----------------------------------------------------------------------

</a> '</a>'

----------------------------------------------------------------------

Sign In

string search and replace

Recommended Posts

gevans

Link to comment

Share on other sites

MadTechie

Link to comment

Share on other sites

gevans

Link to comment

Share on other sites

gevans

Link to comment

Share on other sites

MadTechie

Link to comment

Share on other sites

gevans

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

gevans

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

gevans

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information