Jump to content

preg_match_all Help with Whitespace and Newlines


savagenoob

Recommended Posts

I have code:

$proname1 = preg_match_all('/<div class=("|\')agentContainer("|\')>(\n\s)<div class="strong">(\n\s)(.*?)(\n\s)<\/div>/', $html, $name1);

Which is putting everything between these tags into an array, but the info contains new lines and whitespace, thus displaying empty entries in the array. How do I strip the whitespace and newlines prior to getting to the array? The data Im getting looks like...

 

<div class="agentContainer">
<div class="strong">
	Blah Blah Company
</div>

And blah blah company isnt showing up in the array, but I know the regex is working.

Since I was interested on making it work for my own purposes, here is how it would be done:

 

<?php
$string = '<html><head></head><body>
<div class="anotherClass">
</div>
<div class="agentContainer">
        <div class="strong">
                Blah Blah Company
        </div>
</div>
<div class="moreDiv">

</div></body></html>';

$doc = new DOMDocument();
$doc->loadHTML($string);
$xpath = new DOMXPath($doc);

//$divs = $doc->getElementsByTagName('div');
$query = '*//div[@class="agentContainer"]//div[@class="strong"]';

$entries = $xpath->query($query);
foreach ($entries as $entry) {
        echo trim($entry->nodeValue) . "\n\n";
}

I get the DOM reply everytime I ask a regex question lol, I know DOM is better for straight scraping, but I dont think godaddy has DOM enabled on my server for a very odd and dumb reason. I was trying to get simple_html_dom.php working yesterday and even the examples wouldnt output. Plus some of the sites I scrape use POST to get data so I use cUrl alot.

but I dont think godaddy has DOM enabled on my server

 

Whelp, there's your problem...godaddy. Anyhow beside the point.

 

$proname1 = preg_match_all('/<div class=["\']agentContainer["\']>.*<div class="strong">(.*?)<\/div>/s', $html, $name1);

 

The 's' modifier makes it so that the . ignores new line characters. So it should match like that. Since I highly doubt you want to keep the quote type matched, use a bracket and just put both quotes in there, this will match one of them. After you get the matches out, you just need to run trim on the variables in $name1 to remove the line breaks and extra spaces and this should be good.

 

FYI, post regex questions in the Regex forum.  The m tags are using to link to the PHP Manual, use code or php tags for putting code in.

Thanks primiso, sorry for the wrong tags and placement of post, but I hear *crickets* in the regex forum. Now this is a PHP question, your code almost got me there, I think. This put all the information into one element of one array. This actually makes it easy to strip and trim, but I still have one long friggin string with all the info need. It looks like...

 Business 1 1111 Something Blvd #5Sacramento, CA 11111(916) 111-1111 View Agency Details & Map / E-mail this Agency Business 2 2222 Sierra Gate PlazaSomewhere, CA 22222(916) 222-2222 View Agency Details & Map / E-mail this Agency

and so on... Is there a way to chop this up? I would imagine explode would be a nightmare...

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.