Jump to content

thebadbad

Members
  • Posts

    1,613
  • Joined

  • Last visited

Everything posted by thebadbad

  1. I would probably parse each URL with parse_url(), for more reliable results: <?php $str = '<a href="http://www.mediafire.com/something">http://www.mediafire.com/something</a> text text text text text text text text text text text <a href="http://www.somelink.com"></a> text text text text text text text <a href="http://megaupload.com/something/blabla">Some link without WWW</a> text text text text text text text text text text text text text text <a title="somename" href="http://www.rapidshare.com/download" style="color:#000000" target="_blank">some confusing link</a> ...........more text here.......... <a href="http://www.microsoft.com"></a> ....more text........... <a href="http://www.4shared.com/download.php?file=myfile"><img src="a_link_with_an_image.gif"></a>'; function _callback($matches) { $domains = array('mediafire.com', 'megaupload.com', 'rapidshare.com', '4shared.com'); $domain = parse_url($matches[2], PHP_URL_HOST); //remove any sub domains $parts = array_reverse(explode('.', $domain)); $domain = "{$parts[1]}.{$parts[0]}"; if (in_array($domain, $domains)) { $matches[0] = "[DL]{$matches[0]}[/DL]"; } return $matches[0]; } $str = preg_replace_callback('~<a\b[^>]+\bhref\s?=\s?([\'"])(.+?)\1[^>]*>.*?</a>~is', '_callback', $str); echo $str; ?> If you add a domain with a double TLD (e.g. .co.uk) to the $domains array, you would have to rewrite the code.
  2. If you're looking for a more robust way of translating relative paths to absolute paths, there's a function at http://w-shadow.com/blog/2007/07/16/how-to-extract-all-urls-from-a-page-using-php/. A way to use it: <?php function relative2absolute($absolute, $relative) { $p = @parse_url($relative); if(!$p) { //$relative is a seriously malformed URL return false; } if(isset($p["scheme"])) return $relative; $parts=(parse_url($absolute)); if(substr($relative,0,1)=='/') { $cparts = (explode("/", $relative)); array_shift($cparts); } else { if(isset($parts['path'])){ $aparts=explode('/',$parts['path']); array_pop($aparts); $aparts=array_filter($aparts); } else { $aparts=array(); } $rparts = (explode("/", $relative)); $cparts = array_merge($aparts, $rparts); foreach($cparts as $i => $part) { if($part == '.') { unset($cparts[$i]); } else if($part == '..') { unset($cparts[$i]); unset($cparts[$i-1]); } } } $path = implode("/", $cparts); $url = ''; if($parts['scheme']) { $url = "$parts[scheme]://"; } if(isset($parts['user'])) { $url .= $parts['user']; if(isset($parts['pass'])) { $url .= ":".$parts['pass']; } $url .= "@"; } if(isset($parts['host'])) { $url .= $parts['host']."/"; } $url .= $path; return $url; } $raw = preg_replace_callback( '~\b(href|src)\s?=\s?([\'"])(.+?)\2~is', create_function( '$matches', 'return $matches[1] . \'=\' . $matches[2] . relative2absolute(\'http://www.domain.com/\', $matches[3]) . $matches[2];' ), $raw ); ?>
  3. Small alteration of thorpe's code: <?php $urls = array(); foreach ($array as $arr) { if ($arr['name'] == 'indie') { $urls[] = $arr['url']; } } //see contents of the $urls array echo '<pre>' . print_r($urls, true) . '</pre>'; ?>
  4. If we first split the string at the Vegetables header, we can then grab each vegetable and amount with a regular expression and then do whatever we want with the data: <?php $html = <<<HTML <HTML> <HEAD> <TITLE>Inventory</TITLE> </HEAD> <BODY> <H2>Inventory</H2> for <B>Monday, December 5, 2009</B><BR> <BR> <A NAME="I1"> <B>Fruits</B><BR> <FONT SIZE="-1"><A NAME="F1"> <B>Apples</B><BR> 10<BR> <B>Pears</B><BR> 5<BR> </FONT> <FONT SIZE="-2"><A HREF="index.html">Return to home...</A></FONT><BR CLEAR="LEFT"> <HR> <B>Vegetables</B> <BR> <FONT SIZE="-1"><A NAME="V1"> <B>Corn Cobs</B><BR> 3<BR> <A NAME="S5795_3"><B>Lettuce Heads</B><BR> 10<BR> <A NAME="S5795_5"><B>Potatoes</B><BR> 3<BR> </FONT> <FONT SIZE="-2"><A HREF="index.html">Return to home...</A></FONT><BR CLEAR="LEFT"> <HR> <BR> </BODY> </HTML> HTML; list(, $html) = explode('<B>Vegetables</B> <BR>', $html, 2); preg_match_all('~<B>([^<]+)</B><BR>\s*([0-9]+)<BR>~i', $html, $matches, PREG_SET_ORDER); //print structured data echo '<table>'; foreach ($matches as $match) { echo "\n\t<tr><td>{$match[1]}</td><td>{$match[2]}</td></tr>"; } echo "\n</table>"; ?>
  5. I did see your PM, but the problem is that the script doesn't work for this season, and that I haven't had the time to fix it/rewrite it yet. But I may come around doing it at some point.
  6. Where did your delimiters go in the site4, site5 and site6 prefixes? Your code will fail to work when you mess up the delimiters. To include both single and double quotes in a string, you either have to escape the one used as string delimiter (not to be confused with regex delimiter) or e.g. use the heredoc syntax: $str = 'resultTitle\' id=\'infopei\'><a href="'; $str = "resultTitle' id='infopei'><a href=\""; or $str = <<<HTML resultTitle' id='infopei'><a href=" HTML; And I would incorporate the use of preg_quote() instead of your approach, to separate literal text from the regex pattern: <?php define('REGEX', '([^\s]*?)'); //quite important to make the quantifier lazy in your case, to end the match at the first occurrence of the suffixes define('DELIMITER', '~'); define('MODIFIERS', 'i'); $parts = array( array('<p class="g"><font size="-2"><b></b></font> <a href="', '">'), array('<span class=title><a href="', '">'), array('<p class="g"><font size="-2"><b></b></font> <a href="', ''), array('NONE', ''), array('<p class=g><!--m--><a href=', '>'), array('<h2 class=r><a class=l href="', '">') ); //test with first prefix and suffix $pattern = DELIMITER . preg_quote($parts[0][0], DELIMITER) . REGEX . preg_quote($parts[0][1], DELIMITER) . DELIMITER . MODIFIERS; preg_match($pattern, $data, $match); echo $match[1]; ?>
  7. Because there's a line break between the two divs. Try to add \s* between them, and you should also make your quantifier lazy by adding a question mark after .* (stopping the match at the first encountered </div> character sequence, not the last).
  8. Just run $name through rawurlencode() (as it seems their system doesn't like plus chars). Else, str_replace(' ', '%20', $name) should work fine.
  9. That would be the source code you're retrieving. If you 'trust' the URL you're grabbing, you could simply do <?php //$data holds the source code of the remote page preg_match('~<p class="g"><font size="-2"><b></b></font> <a href="([^"]*)">~i', $data, $match); echo $match[1]; ?> Assuming the prefix and suffix actually match with the source code. Else if you want to keep your URL pattern, try this, using a modified version of the pattern you provided (it had some errors/opportunities for improvement): <?php preg_match('~<p class="g"><font size="-2"><b></b></font> <a href="(https?://[a-z0-9]+(?:[-.][a-z0-9]+)*\.[a-z]{2,6}(?::[0-9]{1,5})?(?:/.*?)?)">~is', $data, $match); echo $match[1]; ?> @salathe You forgot to add the delimiter as the second parameter to preg_quote().
  10. Sorry, forgot the rest of the expression And I just realized that there's no need to run strip_tags() with the second parameter before translating the tags in question to BBCode. Updated code: <?php $content = '<div> <p>This is an image </p> <img src="http://image.info/200910/186336.jpg" border="0" alt="" /><br /><br /> </div>'; $replace = array( '~<img\b[^>]+\bsrc\s?=\s?([\'"])(.*?)\1[^>]*>~is' => '[img=$2]', '~<b\b[^>]*>(.*?)</b>~is' => '[b]$1[/b]' ); $content = preg_replace(array_keys($replace), $replace, $content); $content = strip_tags($content); ?>
  11. < and > function as pattern delimiters in your pattern <img>, thus only the literal img are replaced. Probably doesn't make sense to you, but here's how you could do it: <?php $content = '<div> <p>This is an image </p> <img src="http://image.info/200910/186336.jpg" border="0" alt="" /><br /><br /> </div>'; $content = strip_tags($content, '<img>'); $content = preg_replace('~<img\b[^>]+\bsrc\s?=\s?([\'"])(.*?)\1~is', '[img=$2]', $content); ?> Just ask if you need something explained, and I'm sure a kind soul (if not me ) will help you understand.
  12. If the syntax is strictly as in your sample (i.e. with no single quotes in the random text), you could also use a single preg_match_all() call: <?php $source = file_get_contents('filename.txt'); preg_match_all('~\'([^\']+)\'~', $source, $matches); $data = implode('', $matches[1]); ?>
  13. Google doesn't allow automated searches. But apart from that, you could scrape the result pages (may be a good idea to set the user agent string before you load the pages with either file_get_contents() or cURL), putting page links into an array and store it, then repeat some other day, and compare the arrays with some of the array functions (or MySQL functions if you store the information in a database). The hardest part would be to grab the result links from the source code. But should be doable with the proper regular expression, or maybe with PHP DOM (but I doubt it looking at Google's source code). E.g. I first thought that the result link anchors had the exlusive class l (lowercase L), but also book search and news results have those.
  14. Here's an idea: <?php $string = 'http://site1.com/file.php http://site5.com/file.php http://www.site2.com/file.php'; //grab every URL preg_match_all('~https?://[^" ]+~i', $string, $matches); //filter out the domains not on our whitelist function _callback($url) { $whitelist = array( 'site1.com', 'www.site1.com', 'site2.com', 'www.site2.com', 'site3.com', 'www.site3.com' ); return in_array(parse_url($url, PHP_URL_HOST), $whitelist); } $urls = array_filter($matches[0], '_callback'); echo '<pre>' . print_r($urls, true) . '</pre>'; ?> But there's a few problems here. Firstly the regular expression isn't perfect (mainly because it's also supposed to grab 'plain' URLs not part of a HTML tag with delimiting quotes), and secondly the whitelist currently must contain all variants of the URLs, i.e. including subdomains. But I'm sure you can find a function to return the pure domain (it's a bit tricky because you have to take into account 'double TLDs' like .co.uk). If you don't need to extract 'plain' URLs (see above) from the page, but only URLs from href (and possibly src) attributes, you can use this safer regular expression instead: '~\b(?:href|src)\s?=\s?([\'"])(.+?)\1~is' and then feed $matches[2] to array_filter().
  15. If you want to run the function and use its output in the replacement, you would have to use preg_replace_callback() e.g.: <?php $message = preg_replace_callback( '#\[code=(.*?)\](.*?)\[/code\]#i', create_function( '$matches', 'return \'<div class="codeblock">\' . geshify($matches[1], $matches[2]) . \'</div>\';' ), $message ); ?>
  16. Addition: Forgot to grab the titles. Although my for loop isn't that elegant. <?php $page = 1; $data = array(); while (true) { $html = file_get_contents('http://www.mytinyphone.com/ringtones/classical/?page_ring=' . $page++); $match_count = preg_match_all('~href="/ringtone/([0-9]+)/"><img[^>]*>(.*?)</a>~is', $html, $matches); if ($match_count > 0) { for ($i = 0; $i < $match_count; $i++) { $data[] = array($matches[1][$i], $matches[2][$i]); } } else { //page doesn't exist break; } } echo '<pre>' . print_r($data, true) . '</pre>'; ?>
  17. A way of doing it: <?php $page = 1; $ids = array(); while (true) { $html = file_get_contents('http://www.mytinyphone.com/ringtones/classical/?page_ring=' . $page++); $match_count = preg_match_all('~href="/ringtone/([0-9]+)/~i', $html, $matches); if ($match_count > 0) { $ids = array_merge($ids, $matches[1]); } else { //page doesn't exist break; } } echo '<pre>' . print_r($ids, true) . '</pre>'; ?> Will load all pages though, and thus probably time out, but should work with the appropriate settings (assuming the website in question doesn't cut you off due to too many requests). I'm not too sure about this, but maybe it could be optimized by loading all the page sources into a single string first, and then run a single preg_match_all() on the huge string. Don't know if it'll be more efficient.
  18. Simple. $_GET['ID'] will contain the id, and janusmccarthy already pointed you to the manual page for the setcookie() function.
  19. Or alternatively <?php $balls = range(1, 36); shuffle($balls); echo implode('<br />', array_slice($balls, 0, 3)); //or simply access the random numbers via $balls[0], $balls[1] and $balls[2] ?>
  20. That's a capital o in his sample, not a zero. Easy mistake to make though
  21. I would probably go with preg_replace_callback() (with nested preg_replace() calls): <?php $str = '(S2O3)-2'; $str = preg_replace_callback( '~\([^)]*\)~', create_function( '$matches', 'return preg_replace(\'~[0-9]+~\', \'<sub>$0</sub>\', $matches[0]);' ), $str ); echo $str; ?>
  22. The simple way: <?php if (isset($_SERVER['HTTP_REFERER'])) { $info = parse_url($_SERVER['HTTP_REFERER']); if (isset($info['host'])) { echo 'Welcome ' . htmlentities($info['host']) . ' Members.'; } else { //invalid referrer } } else { //referrer not set } ?>
  23. Sure It's always good to point things out, even if they're slightly off topic. People might actually learn something!
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.