Jump to content

Recommended Posts

I have code:

$proname1 = preg_match_all('/<div class=("|\')agentContainer("|\')>(\n\s)<div class="strong">(\n\s)(.*?)(\n\s)<\/div>/', $html, $name1);

Which is putting everything between these tags into an array, but the info contains new lines and whitespace, thus displaying empty entries in the array. How do I strip the whitespace and newlines prior to getting to the array? The data Im getting looks like...

 

<div class="agentContainer">
<div class="strong">
	Blah Blah Company
</div>

And blah blah company isnt showing up in the array, but I know the regex is working.

If you have valid HTML (if not make it valid) use the DOM. This makes parsing HTML and pulling out items so much easier than using regex, plus it does not require certain attributes to be in specific locations etc.

Since I was interested on making it work for my own purposes, here is how it would be done:

 

<?php
$string = '<html><head></head><body>
<div class="anotherClass">
</div>
<div class="agentContainer">
        <div class="strong">
                Blah Blah Company
        </div>
</div>
<div class="moreDiv">

</div></body></html>';

$doc = new DOMDocument();
$doc->loadHTML($string);
$xpath = new DOMXPath($doc);

//$divs = $doc->getElementsByTagName('div');
$query = '*//div[@class="agentContainer"]//div[@class="strong"]';

$entries = $xpath->query($query);
foreach ($entries as $entry) {
        echo trim($entry->nodeValue) . "\n\n";
}

I get the DOM reply everytime I ask a regex question lol, I know DOM is better for straight scraping, but I dont think godaddy has DOM enabled on my server for a very odd and dumb reason. I was trying to get simple_html_dom.php working yesterday and even the examples wouldnt output. Plus some of the sites I scrape use POST to get data so I use cUrl alot.

but I dont think godaddy has DOM enabled on my server

 

Whelp, there's your problem...godaddy. Anyhow beside the point.

 

$proname1 = preg_match_all('/<div class=["\']agentContainer["\']>.*<div class="strong">(.*?)<\/div>/s', $html, $name1);

 

The 's' modifier makes it so that the . ignores new line characters. So it should match like that. Since I highly doubt you want to keep the quote type matched, use a bracket and just put both quotes in there, this will match one of them. After you get the matches out, you just need to run trim on the variables in $name1 to remove the line breaks and extra spaces and this should be good.

 

FYI, post regex questions in the Regex forum.  The m tags are using to link to the PHP Manual, use code or php tags for putting code in.

Thanks primiso, sorry for the wrong tags and placement of post, but I hear *crickets* in the regex forum. Now this is a PHP question, your code almost got me there, I think. This put all the information into one element of one array. This actually makes it easy to strip and trim, but I still have one long friggin string with all the info need. It looks like...

 Business 1 1111 Something Blvd #5Sacramento, CA 11111(916) 111-1111 View Agency Details & Map / E-mail this Agency Business 2 2222 Sierra Gate PlazaSomewhere, CA 22222(916) 222-2222 View Agency Details & Map / E-mail this Agency

and so on... Is there a way to chop this up? I would imagine explode would be a nightmare...

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.