JBud Posted March 15, 2010 Share Posted March 15, 2010 Hey guys, I'm trying out website scraping for the first time (in particular on Craigslist, getting a list of jobs from certain areas), and I keep getting this error about the memory size. I've tried setting up methods with pure string manipulation: // <p> Mar 11 - <a href="/van/sof/1639429290.html">Sr. Graphics SW Architect.</a> - </p> function extractNextLink(&$code) { $keyBeginning = '<p>'; $keyEnd = '</p>'; ****** $code = stristr($code, $keyBeginning); if ($code === FALSE) return "a"; $hitEnd = strpos($code, $keyEnd); if ($hitEnd === FALSE) return "b"; $nextLink = substr($code, 0, $hitEnd); return $nextLink; // <-- Note we don't fix $code since we're searching for <p><a href" where <p> was already removed anyways (means less large string parsing) } and Regex function getFirstElement(&$code) { $pattern = '/<p><a href="http:\/\/[a-zA-Z]{1,15}\.craigslist\.[a-z]{1,5}\/[a-z]{1,5}\/[a-z]{1,5}\/[0-9]{1,10}.html">/'; ****** preg_match($pattern, $code, $matches, PREG_OFFSET_CAPTURE); $beginning = $matches[0][1] + strlen($matches[0][0]); $pattern = '/<\/a> - <font size="-1">/'; preg_match($pattern, $code, $matches, PREG_OFFSET_CAPTURE); $end = -1; $endLen; foreach ($matches as $match) { $end = $match[1]; if ($end > $beginning) { $endLen = strlen($match[0]); break; } } if ($end == -1) return ""; $returnable = substr($code, $beginning, $end - $beginning); $code = substr($code, $end + $endLen); return $returnable; } The *'s represent where the error comes up in both methods. What am I doing wrong here? How can I go about extracting these links from the code more efficiently? Any ideas ?? Thanks =] Quote Link to comment Share on other sites More sharing options...
schilly Posted March 15, 2010 Share Posted March 15, 2010 you can try upping the memory of PHP: ini_set('memory_limit','32M'); Quote Link to comment Share on other sites More sharing options...
JBud Posted March 15, 2010 Author Share Posted March 15, 2010 Hey Schilly, thanks for the tip. I should have mentioned it in my post, but I already came across this solution, and I'd really rather find a more efficient way to parse the html. Increasing the memory limit is only really a temporary solution =[ Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.