savagenoob Posted October 7, 2011 Share Posted October 7, 2011 I have code: $proname1 = preg_match_all('/<div class=("|\')agentContainer("|\')>(\n\s)<div class="strong">(\n\s)(.*?)(\n\s)<\/div>/', $html, $name1); Which is putting everything between these tags into an array, but the info contains new lines and whitespace, thus displaying empty entries in the array. How do I strip the whitespace and newlines prior to getting to the array? The data Im getting looks like... <div class="agentContainer"> <div class="strong"> Blah Blah Company </div> And blah blah company isnt showing up in the array, but I know the regex is working. Quote Link to comment https://forums.phpfreaks.com/topic/248640-preg_match_all-help-with-whitespace-and-newlines/ Share on other sites More sharing options...
premiso Posted October 7, 2011 Share Posted October 7, 2011 If you have valid HTML (if not make it valid) use the DOM. This makes parsing HTML and pulling out items so much easier than using regex, plus it does not require certain attributes to be in specific locations etc. Quote Link to comment https://forums.phpfreaks.com/topic/248640-preg_match_all-help-with-whitespace-and-newlines/#findComment-1276917 Share on other sites More sharing options...
premiso Posted October 7, 2011 Share Posted October 7, 2011 Since I was interested on making it work for my own purposes, here is how it would be done: <?php $string = '<html><head></head><body> <div class="anotherClass"> </div> <div class="agentContainer"> <div class="strong"> Blah Blah Company </div> </div> <div class="moreDiv"> </div></body></html>'; $doc = new DOMDocument(); $doc->loadHTML($string); $xpath = new DOMXPath($doc); //$divs = $doc->getElementsByTagName('div'); $query = '*//div[@class="agentContainer"]//div[@class="strong"]'; $entries = $xpath->query($query); foreach ($entries as $entry) { echo trim($entry->nodeValue) . "\n\n"; } Quote Link to comment https://forums.phpfreaks.com/topic/248640-preg_match_all-help-with-whitespace-and-newlines/#findComment-1276932 Share on other sites More sharing options...
savagenoob Posted October 7, 2011 Author Share Posted October 7, 2011 I get the DOM reply everytime I ask a regex question lol, I know DOM is better for straight scraping, but I dont think godaddy has DOM enabled on my server for a very odd and dumb reason. I was trying to get simple_html_dom.php working yesterday and even the examples wouldnt output. Plus some of the sites I scrape use POST to get data so I use cUrl alot. Quote Link to comment https://forums.phpfreaks.com/topic/248640-preg_match_all-help-with-whitespace-and-newlines/#findComment-1276935 Share on other sites More sharing options...
premiso Posted October 7, 2011 Share Posted October 7, 2011 but I dont think godaddy has DOM enabled on my server Whelp, there's your problem...godaddy. Anyhow beside the point. $proname1 = preg_match_all('/<div class=["\']agentContainer["\']>.*<div class="strong">(.*?)<\/div>/s', $html, $name1); The 's' modifier makes it so that the . ignores new line characters. So it should match like that. Since I highly doubt you want to keep the quote type matched, use a bracket and just put both quotes in there, this will match one of them. After you get the matches out, you just need to run trim on the variables in $name1 to remove the line breaks and extra spaces and this should be good. FYI, post regex questions in the Regex forum. The m tags are using to link to the PHP Manual, use code or php tags for putting code in. Quote Link to comment https://forums.phpfreaks.com/topic/248640-preg_match_all-help-with-whitespace-and-newlines/#findComment-1276959 Share on other sites More sharing options...
savagenoob Posted October 7, 2011 Author Share Posted October 7, 2011 Thanks primiso, sorry for the wrong tags and placement of post, but I hear *crickets* in the regex forum. Now this is a PHP question, your code almost got me there, I think. This put all the information into one element of one array. This actually makes it easy to strip and trim, but I still have one long friggin string with all the info need. It looks like... Business 1 1111 Something Blvd #5Sacramento, CA 11111(916) 111-1111 View Agency Details & Map / E-mail this Agency Business 2 2222 Sierra Gate PlazaSomewhere, CA 22222(916) 222-2222 View Agency Details & Map / E-mail this Agency and so on... Is there a way to chop this up? I would imagine explode would be a nightmare... Quote Link to comment https://forums.phpfreaks.com/topic/248640-preg_match_all-help-with-whitespace-and-newlines/#findComment-1276991 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.