Jump to content

[SOLVED] Using preg_quote in preg_replace crashes script.


dmarquard

Recommended Posts

I'm trying to write a script that temporarily renames all current links in the page (don't ask).  I'm using preg_quote within preg_replace for find and replace all instances because, for whatever reason, str_ireplace (which I'd honestly rather use) would replace only half of the instances in the code before giving up.  I think my code looks fine, but the page won't even load, so I guess I screwed up somewhere.  Let me know if you think you spot a flaw.  Alternatively, you can also tell me why you think str_ireplace only replaced half of all instances in the code.  :)

 

                    $url_source = preg_replace('/' . preg_quote('href="#') . '/', 'preg_replace_url_anchor', $url_source); // Encode anchors.
                    $url_source = preg_replace('/' . preg_quote('href=""') . '/', 'preg_replace_url_null', $url_source); // Encode null links.
                    $url_source = preg_replace('/' . preg_quote('href="http://') . '/', 'preg_replace_url_http', $url_source); // Encode existing HTTP links.
                    $url_source = preg_replace('/' . preg_quote('href="https://') . '/', 'preg_replace_url_https', $url_source); // Encode existing HTTPS links.
                    $url_source = preg_replace('/' . preg_quote('href="ftp://') . '/', 'preg_replace_url_ftp', $url_source); // Encode existing FTP links.

 

Thanks!

It looks OK to me. There's no need to use preg_quote when entering the pattern into your code--this is mainly used to sanitize user input and/or other variables that may change.

 

What was leftover that str_ireplace missed?

 

I'm using it because it's cleaner than escaping EVERY special character (VERY unclean).  str_replace would just crap out half way into the replacements...it would just stop and links would be left unchanged.

 

I have yet to find a solution to this...

If you change the delimiter none of the characters need to be escaped.

 

Please provide some code and data that shows where str_ireplace succeeded and failed.

 

I'm not sure how to change the delimiter, and I can't seem to find any sort of documentation referencing it.

 

One user on php.net commented that str_replace and str_ireplace seem to stop replacing at approximately 16K (with a string 35K or larger).  I'd prefer to stick with str_ireplace, but this is just getting ridiculous.

 

I can't find my original str_ireplace code, but this was what I threw together just now.  It's not replacing at all...

 

          // Grab the submission page's raw, unformatted source code.
          $url_source = file_get_contents($url_submission);

          // Parse the URL into a base URL (WITHOUT the trailing slash), just in case the webpage uses relative URL references.
          $url_submission_scheme = parse_url($url_submission, PHP_URL_SCHEME);
          $url_submission_host = parse_url($url_submission, PHP_URL_HOST);
          $url_submission_base = $url_submission_scheme . '://' . $url_submission_host;

               // Only the links on pages which DON'T define a base URL should be modified.
               if (!(preg_match('/' . preg_quote('<base href="') . '/', $url_source)))
               {
                    // Modifier to <a href="/path/to/dir.php">.
                    $url_source = str_ireplace(preg_quote('href="/'), 'href="' . $url_submission_base . '/', $url_source);

                    // OK, now comes the weird part...we actually have to exclude exceptions by temporarily masking them during the conversion.
                    $url_source = str_ireplace(preg_quote('href="#'), 'str_ireplace_url_anchor', $url_source); // Encode anchors.
                    $url_source = str_ireplace(preg_quote('href=""'), 'str_ireplace_url_null', $url_source); // Encode null links.
                    $url_source = str_ireplace(preg_quote('href="http://'), 'str_ireplace_url_http', $url_source); // Encode existing HTTP links.
                    $url_source = str_ireplace(preg_quote('href="https://'), 'str_ireplace_url_https', $url_source); // Encode existing HTTPS links.
                    $url_source = str_ireplace(preg_quote('href="ftp://'), 'str_ireplace_url_ftp', $url_source); // Encode existing FTP links.

                    // Mask all known program protocol links.
                    $url_source = str_ireplace(preg_quote('href="javascript:'), 'str_ireplace_url_js', $url_source); // Encode javascript links.
                    $url_source = str_ireplace(preg_quote('href="mailto:'), 'str_ireplace_url_mailto', $url_source); // Encode email links.
                    $url_source = str_ireplace(preg_quote('href="aim:'), 'str_ireplace_url_aim', $url_source); // Encode AIM links.
                    $url_source = str_ireplace(preg_quote('href="callto:'), 'str_ireplace_url_callto', $url_source); // Encode Skype links.

                    // Now that we've temporarily masked all link exceptions, we can rename ALL remaining links.
                    $url_source = str_ireplace(preg_quote('href="'), 'href="' . $url_submission_base . '/', $url_source);

                    // Time to unmask our temporarily renamed links.
                    $url_source = str_ireplace(preg_quote('str_ireplace_url_anchor'), 'href="#', $url_source); // Decode anchors.
                    $url_source = str_ireplace(preg_quote('str_ireplace_url_null'), 'href=""', $url_source); // Decode null links.
                    $url_source = str_ireplace(preg_quote('str_ireplace_url_http'), 'href="http://', $url_source); // Decode existing HTTP links.
                    $url_source = str_ireplace(preg_quote('str_ireplace_url_https'), 'href="https://', $url_source); // Decode existing HTTPS links.
                    $url_source = str_ireplace(preg_quote('str_ireplace_url_ftp'), 'href="ftp://', $url_source); // Decode existing FTP links.

                    // ...and all program protocal addresses.
                    $url_source = str_ireplace(preg_quote('str_ireplace_url_js'), 'href="javascript:', $url_source); // Decode javascript links.
                    $url_source = str_ireplace(preg_quote('str_ireplace_url_mailto'), 'href="mailto:', $url_source); // Decode email links.
                    $url_source = str_ireplace(preg_quote('str_ireplace_url_aim'), 'href="aim:', $url_source); // Decode AIM links.
                    $url_source = str_ireplace(preg_quote('str_ireplace_url_callto'), 'href="callto:', $url_source); // Decode Skype links.*/
               }

               // Since base URLs have no effect on other paths, make all calls absolute.
               // Correct all image paths.
               $url_source = str_ireplace(preg_quote('src="/'), 'src="' . $url_submission_base . '/', $url_source);

               // Encode all existing absolute image references.
               $url_source = str_ireplace(preg_quote('src="http://'), 'str_ireplace_img_http', $url_source); // Encode HTTP image references.
               $url_source = str_ireplace(preg_quote('src="https://'), 'str_ireplace_img_https', $url_source); // Encode HTTPS image references.

               // Now, make all relative image references NOT proceeded by a '/' absolute. 
               $url_source = str_ireplace(preg_quote('src="'), 'src="' . $url_submission_base . '/', $url_source);

               // Decode our maked image references.
               $url_source = str_ireplace(preg_quote('str_ireplace_img_http'), 'src="http://', $url_source); // Decode HTTP image references.
               $url_source = str_ireplace(preg_quote('str_ireplace_img_https'), 'src="https://', $url_source); // Decode HTTPS image references.

               // Horrible stylesheet include function. 
               $url_source = str_ireplace(preg_quote('@import "/'), '@import "' . $url_submission_base . '/', $url_source);

 

Maybe there's a more efficient way of doing this, but I don't know what it is.

Delimiters.

 

preg_quote should only be used in conjunction with preg_* functions. It's not needed for str_ireplace and will botch things up. I still haven't seen the data this is failing on...

 

I reverted back to str_ireplace.  The code below is my script to convert relative URLs to absolute URLs:

 

          // Grab the submission page's raw, unformatted source code.
          $url_source = file_get_contents($url_submission);

          // Parse the URL into a base URL (WITHOUT the trailing slash), just in case the webpage uses relative URL references.
          $url_submission_scheme = parse_url($url_submission, PHP_URL_SCHEME);
          $url_submission_host = parse_url($url_submission, PHP_URL_HOST);
          $url_submission_base = $url_submission_scheme . '://' . $url_submission_host;

               // Only the links on pages which DON'T define a base URL should be modified.
               if (!(preg_match('/' . preg_quote('<base href="') . '/', $url_source)))
               {
                    // Modifier to <a href="/path/to/dir.php">.
                    $url_source = str_ireplace('href="/', 'href="' . $url_submission_base . '/', $url_source);

                    // OK, now comes the weird part...we actually have to exclude exceptions by temporarily masking them during the conversion.
                    $url_source = str_ireplace('href="#', 'str_ireplace_url_anchor', $url_source); // Encode anchors.
                    $url_source = str_ireplace('href=""', 'str_ireplace_url_null', $url_source); // Encode null links.
                    $url_source = str_ireplace('href="http://', 'str_ireplace_url_http', $url_source); // Encode existing HTTP links.
                    $url_source = str_ireplace('href="https://', 'str_ireplace_url_https', $url_source); // Encode existing HTTPS links.
                    $url_source = str_ireplace('href="ftp://', 'str_ireplace_url_ftp', $url_source); // Encode existing FTP links.

                    // Mask all known program protocol links.
                    $url_source = str_ireplace('href="javascript:', 'str_ireplace_url_js', $url_source); // Encode javascript links.
                    $url_source = str_ireplace('href="mailto:', 'str_ireplace_url_mailto', $url_source); // Encode email links.
                    $url_source = str_ireplace('href="aim:', 'str_ireplace_url_aim', $url_source); // Encode AIM links.
                    $url_source = str_ireplace('href="callto:', 'str_ireplace_url_callto', $url_source); // Encode Skype links.

                    // Now that we've temporarily masked all link exceptions, we can rename ALL remaining links.
                    $url_source = str_ireplace('href="', 'href="' . $url_submission_base . '/', $url_source);

                    // Time to unmask our temporarily renamed links.
                    $url_source = str_ireplace('str_ireplace_url_anchor', 'href="#', $url_source); // Decode anchors.
                    $url_source = str_ireplace('str_ireplace_url_null', 'href=""', $url_source); // Decode null links.
                    $url_source = str_ireplace('str_ireplace_url_http', 'href="http://', $url_source); // Decode existing HTTP links.
                    $url_source = str_ireplace('str_ireplace_url_https', 'href="https://', $url_source); // Decode existing HTTPS links.
                    $url_source = str_ireplace('str_ireplace_url_ftp', 'href="ftp://', $url_source); // Decode existing FTP links.

                    // ...and all program protocal addresses.
                    $url_source = str_ireplace('str_ireplace_url_js', 'href="javascript:', $url_source); // Decode javascript links.
                    $url_source = str_ireplace('str_ireplace_url_mailto', 'href="mailto:', $url_source); // Decode email links.
                    $url_source = str_ireplace('str_ireplace_url_aim', 'href="aim:', $url_source); // Decode AIM links.
                    $url_source = str_ireplace('str_ireplace_url_callto', 'href="callto:', $url_source); // Decode Skype links.*/
               }

               // Since base URLs have no effect on other paths, make all calls absolute.
               // Correct all image paths.
               $url_source = str_ireplace('src="/', 'src="' . $url_submission_base . '/', $url_source);

               // Encode all existing absolute image references.
               $url_source = str_ireplace('src="http://', 'str_ireplace_img_http', $url_source); // Encode HTTP image references.
               $url_source = str_ireplace('src="https://', 'str_ireplace_img_https', $url_source); // Encode HTTPS image references.

               // Now, make all relative image references NOT proceeded by a '/' absolute. 
               $url_source = str_ireplace('src="', 'src="' . $url_submission_base . '/', $url_source);

               // Decode our maked image references.
               $url_source = str_ireplace('str_ireplace_img_http', 'src="http://', $url_source); // Decode HTTP image references.
               $url_source = str_ireplace('str_ireplace_img_https', 'src="https://', $url_source); // Decode HTTPS image references.

               // Horrible stylesheet include function. 
               $url_source = str_ireplace('@import "/', '@import "' . $url_submission_base . '/', $url_source);

          // Format the submission's source code to be inserted.
          $url_source_formatted = mysql_real_escape_string($url_source);

          // Insert all data into a new record.
          $url_sql = 'INSERT INTO `webpages` (`id`, `url`, `source`, `creation`)
                         VALUES (NULL, \'' . $url_submission . '\', \'' . $url_source_formatted . '\', NOW());';
          mysql_query($url_sql)
               or die('We were unable to process your link. Please try <a href="' . $abs_url . '/mirror/submit.php">resubmitting</a>. (Error: ' . mysql_error() . ')');
          echo $url_source;
     }
}

 

What happens is that only random links are modified (very frustrating).  I'll use Google's main page as an example:

 

<html><head><meta http-equiv="content-type" content="text/html; charset=UTF-8"><title>Google</title><style>body,td,a,p,.h{font-family:arial,sans-serif}.h{font-size:20px}.h{color:#3366cc}.q{color:#00c}.ts td{padding:0}.ts{border-collapse:collapse}.lnc:link,.lnc:visited{color:#00c}.pgtab,.pgtab:hover,.pgtabselected,.pgtabside{text-align:center;text-decoration:none;color:#00c;display:block;height:27px;float:left;overflow:hidden;background:url(/intl/ja/images/productlinktabs.png) no-repeat;padding-top:8px}.pgtab{width:130px;background-position:-274px 0}.pgtab:hover{width:130px;background-position:-144px 0}.pgtabselected{width:144px}.pgtabside{width:3px;background-position:-404px 0}.ptr{cursor:pointer;cursor:hand}.iconl{background:url() no-repeat;overflow:hidden;height:px;width:px}#gbar{float:left;height:22px;padding-left:2px}.gbh,.gb2 div{border-top:1px solid #c9d7f1;font-size:0;height:0}.gbh{position:absolute;top:24px;width:100%}.gb2 div{margin:5px}#gbi{background:#fff;border:1px solid;border-color:#c9d7f1 #36c #36c #a2bae7;font-size:13px;top:24px;z-index:1000}#guser{padding-bottom:7px !important}#gbar,#guser{font-size:13px;padding-top:1px !important}@media all{.gb1,.gb3{height:22px;margin-right:.73em;vertical-align:top}.gb2 a,.gb2 b{display:block;padding:.2em .5em}}#gbi,.gb2{display:none;position:absolute;width:8em}.gb2{z-index:1001}#gbar a{color:#00c}.gb2 a,.gb3 a{text-decoration:none}#gbar .gb2 a:hover{background:#36c;color:#fff;display:block}</style><script>window.google={kEI:"ZdkGSPT0AZjgggTswOiaCQ",kEXPI:"17259,17735",kHL:"en"};
function sf(){document.f.q.focus()}
window.clk=function(b,c,d,e,f,g){if(document.images){var a=encodeURIComponent||escape;(new Image).src="http://www.google.com/url?sa=T"+(c?"&oi="+a(c):"")+(d?"&cad="+a(d):"")+"&ct="+a(e)+"&cd="+a(f)+(b?"&url="+a(b.replace(/#.*/,"")).replace(/\+/g,"%2B"):"")+"&ei=ZdkGSPT0AZjgggTswOiaCQ"+g}return true};
window.gbar={};(function(){var a=window.gbar,b,g,h;function l(c,f,e){c.display=h?"none":"block";c.left=f+"px";c.top=e+"px"}a.tg=function(c){var f=0,e=0,d,m=0,n,j=window.navExtra,k,i=document;g=g||i.getElementById("gbar").getElementsByTagName("span");(c||window.event).cancelBubble=!m;if(!b){b=i.createElement(Array.every||window.createPopup?"iframe":"DIV");b.frameBorder="0";b.scrolling="no";b.src="http://www.google.com/#";g[7].parentNode.appendChild(b).id="gbi";if(j&&g[7])for(n in j){k=i.createElement("span");k.appendChild(j[n]);g[7].parentNode.insertBefore(k,g[7]).className="gb2"}i.onclick=a.close}while(d=g[++m]){if(e){l(d.style,e+1,f+25);f+=d.firstChild.tagName=="DIV"?9:20}if(d.className=="gb3"){do e+=d.offsetLeft;while(d=d.offsetParent)}}b.style.height=f+"px";l(b.style,e,24);h=!h};a.close=function(c){h&&a.tg(c)}})();</script></head><body bgcolor=#ffffff text=#000000 link=#0000cc vlink=#551a8b alink=#ff0000 onload="sf();if(document.images){new Image().src='/images/nav_logo3.png'}" topmargin=3 marginheight=3><div id=gbar><nobr><span class=gb1><b>Web</b></span> <span class=gb1><a href="http://images.google.com/imghp?hl=en&tab=wi">Images</a></span> <span class=gb1><a href="http://maps.google.com/maps?hl=en&tab=wl">Maps</a></span> <span class=gb1><a href="http://news.google.com/nwshp?hl=en&tab=wn">News</a></span> <span class=gb1><a href="http://www.google.com/prdhp?hl=en&tab=wf">Shopping</a></span> <span class=gb1><a href="http://mail.google.com/mail/?hl=en&tab=wm">Gmail</a></span> <span class=gb3><a href="http://www.google.com/intl/en/options/" onclick="this.blur();gbar.tg(event);return !1"><u>more</u> <small>&#9660;</small></a></span> <span class=gb2><a href="http://video.google.com/?hl=en&tab=wv">Video</a></span> <span class=gb2><a href="http://groups.google.com/grphp?hl=en&tab=wg">Groups</a></span> <span class=gb2><a href="http://books.google.com/bkshp?hl=en&tab=wp">Books</a></span> <span class=gb2><a href="http://scholar.google.com/schhp?hl=en&tab=ws">Scholar</a></span> <span class=gb2><a href="http://finance.google.com/finance?hl=en&tab=we">Finance</a></span> <span class=gb2><a href="http://blogsearch.google.com/?hl=en&tab=wb">Blogs</a></span> <span class=gb2><div></div></a></span> <span class=gb2><a href="http://www.youtube.com/?hl=en&tab=w1">YouTube</a></span> <span class=gb2><a href="http://www.google.com/calendar/render?hl=en&tab=wc">Calendar</a></span> <span class=gb2><a href="http://picasaweb.google.com/home?hl=en&tab=wq">Photos</a></span> <span class=gb2><a href="http://docs.google.com/?hl=en&tab=wo">Documents</a></span> <span class=gb2><a href="http://www.google.com/reader/view/?hl=en&tab=wy">Reader</a></span> <span class=gb2><div></div></a></span> <span class=gb2><a href="http://www.google.com/intl/en/options/">even more »</a></span> </nobr></div><div class=gbh style=left:0></div><div class=gbh style=right:0></div><div align=right id=guser style="font-size:84%;padding:0 0 4px" width=100%><nobr><a href="http://www.google.com/url?sa=p&pref=ig&pval=3&q=http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg">iGoogle</a> | <a href="http://swww.google.com/accounts/Login?continue=http://www.google.com/&hl=en">Sign in</a></nobr></div><center><br clear=all id=lgpd><img alt="Google" height=110 src="http://www.google.com/intl/en_ALL/images/logo.gif" width=276><br><br><form action="/search" name=f><table cellpadding=0 cellspacing=0><tr valign=top><td width=25%> </td><td align=center nowrap><input name=hl type=hidden value=en><input maxlength=2048 name=q size=55 title="Google Search" value=""><br><input name=btnG type=submit value="Google Search"><input name=btnI type=submit value="I'm Feeling Lucky"></td><td nowrap width=25%><font size=-2>  <a href=/advanced_search?hl=en>Advanced Search</a><br>  <a href=/preferences?hl=en>Preferences</a><br>  <a href=/language_tools?hl=en>Language Tools</a></font></td></tr></table></form><br><br><font size=-1><a href="http://www.google.com/intl/en/ads/">Advertising Programs</a> - <a href="http://www.google.com/services/">Business Solutions</a> - <a href="http://www.google.com/intl/en/about.html">About Google</a></font><p><font size=-2>©2008 Google</font></p></center></body></html>

 

Note that half of the relative links are properly converted, but then it just seems to drop off...

I think I got it fixed.  It turns out that the Google code that wouldn't change was formatted as <a href=page.html>.  Baaaad Google.  >:(

 

I also fixed the complete halt of replacements by upping my php_value memory_limit to 36M.

 

Thanks for the help.  :)

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.