Help with creating regexp

Guldstrand · December 11, 2009

Can someone please help with creating a regexp for the following html-code!?

    <div class="item1"><b class="txt_grey">Svensk titel:</b></div>
    <div class="item2">Familjen Macahan</div>

    <div class="clfix"></div>
</div>
<div class="ds_spec_380_2">
	<div class="item1"><b class="txt_grey">Originaltitel:</b></div>
	<div class="item2">How the West Was Won</div>
	<div class="clfix"></div>
</div>
<div class="ds_spec_380_2">

    <div class="item1"><b class="txt_grey">Genre:</b></div>
    <div class="item2"><a href="ds.php?red=prod_category.php&&arg=genre@@@tvserie,,,lang@@@se,,,subsite@@@movies,,,">TV-serie</a><a href="ds.php?red=prod_category.php&&arg=genre@@@,,,lang@@@se,,,subsite@@@movies,,,"></a></div>
    <div class="clfix"></div>
</div>		

    <div class="ds_spec_380_2">
        <div class="item1"><b class="txt_grey">Underkategori:</b></div>
        <div class="item2">
            Äventyr<br>Kult (60-80-tal)<br>        </div>

        <div class="clfix"></div>
    </div>	
    <div class="ds_spec_380_2">
        <div class="item1"><b class="txt_grey">Produktionsland:</b></div>
        <div class="item2">
                            <a href="ds.php?red=prod_category.php&&arg=genre@@@world,,,cont@@@land_USA,,,lang@@@se,,,subsite@@@movies,,,">USA</a>
                        </div>
        <div class="clfix"></div>

    </div>
    <div class="ds_spec_380_2">
        <div class="item1"><b class="txt_grey">Inspelningsår:</b></div>
        <div class="item2">1977-1979</div>
        <div class="clfix"></div>
    </div>
    <div class="ds_spec_380_2">
        <div class="item1"><b class="txt_grey">Skådespelare:</b></div>

        <div class="item2">
		<a href="ds.php?red=ds_person.php&&arg=id@@@19550,,,lang@@@se,,,subsite@@@movies,,,">James Arness</a><br><a href="ds.php?red=ds_person.php&&arg=id@@@6256,,,lang@@@se,,,subsite@@@movies,,,">Bruce Boxleitner</a><br><a href="ds.php?red=ds_person.php&&arg=id@@@4393,,,lang@@@se,,,subsite@@@movies,,,">Eva Marie Saint</a><br><a href="ds.php?red=ds_person.php&&arg=id@@@29166,,,lang@@@se,,,subsite@@@movies,,,">Kathryn Holcomb</a><br><a href="ds.php?red=ds_person.php&&arg=id@@@29167,,,lang@@@se,,,subsite@@@movies,,,">William Kirby Cullen</a><br><a href="ds.php?red=ds_person.php&&arg=id@@@29168,,,lang@@@se,,,subsite@@@movies,,,">Vicki Schreck</a>    	</div>
        <div class="clfix"></div>
    </div>
    <div class="ds_spec_380_2">

        <div class="item1"><b class="txt_grey">Åldersgräns:</b></div>
        <div class="item2">
		15 år.<br>        </div>

I need to get the following:

# Svensk titel

# Originaltitel

# Genre

# Underkategori

# Produktionsland

# Inspelningsår

# Skådespelare

# Åldersgräns

Thanks in advance...

salathe · December 11, 2009

Does this really need to be a job for regex, or would you be open to considering other methods of retrieving that information?

Guldstrand · December 11, 2009

Does this really need to be a job for regex, or would you be open to considering other methods of retrieving that information?

No it doesn´t have to be regexp, if there is a faster and/or better way to parse the info, i will go for that.

nozai · December 11, 2009

I would look into an XML parser--simplexml_load_file, perhaps?

Guldstrand · December 11, 2009

I would look into an XML parser--simplexml_load_file, perhaps?

This isn´t a XML-document.

cags · December 11, 2009

Is it or is it not HTML? If it is then it's a markup language and as such can be parsed like one. Certainly if there's anything more you need to do then using some kind of document model would be the way to go, but since you asked, a simple Regular Expression for the pattern would be...

preg_match_all('#<div class="item1"><b class="txt_grey">([^:]+):</b></div>#u', $input, $out))

But I'm not saying that's necessarily the right way to go.

nozai · December 11, 2009

Touché, simplexml won't read it directly, but try:

$doc = new DOMDocument();
  $doc->strictErrorChecking = FALSE;
  $doc->loadHTML($text);
  $xml = simplexml_import_dom($doc);

where $text contains your HTML document.

Guldstrand · December 11, 2009

Is it or is it not HTML? If it is then it's a markup language and as such can be parsed like one. Certainly if there's anything more you need to do then using some kind of document model would be the way to go, but since you asked, a simple Regular Expression for the pattern would be...
preg_match_all('#<div class="item1"><b class="txt_grey">([^:]+):</b></div>#u', $input, $out))
But I'm not saying that's necessarily the right way to go.

Thanks.. but i need to get the values of the words/text added in my first post.

cags · December 11, 2009

It returns these...

Svensk titel
Originaltitel
Genre
Underkategori
Produktionsland
Inspelningsår
Skådespelare
Åldersgräns

...which as far as I can tell is exactly what you asked for.

thebadbad · December 12, 2009

Assuming the source looks more or less strictly like that for every film/series, here's a way to do it:

<?php
//$html holds the source code
preg_match_all('~<div class="item1"><b class="txt_grey">([^:]+):</b></div>\s*<div class="item2">(.*?)</div>~is', $html, $matches, PREG_SET_ORDER);
$data = array();
foreach ($matches as $arr) {
$data[$arr[1]] = explode('<br>', $arr[2]);
foreach ($data[$arr[1]] as $key => &$value) {
	$value = trim(strip_tags($value));
	if ($value == '') {
		unset($data[$arr[1]][$key]);
	}
}
}
echo '<pre>' . print_r($data, true) . '</pre>';
?>

Output:

Array
(
    [svensk titel] => Array
        (
            [0] => Familjen Macahan
        )

    [Originaltitel] => Array
        (
            [0] => How the West Was Won
        )

    [Genre] => Array
        (
            [0] => TV-serie
        )

    [underkategori] => Array
        (
            [0] => Äventyr
            [1] => Kult (60-80-tal)
        )

    [Produktionsland] => Array
        (
            [0] => USA
        )

    [inspelningsår] => Array
        (
            [0] => 1977-1979
        )

    [skådespelare] => Array
        (
            [0] => James Arness
            [1] => Bruce Boxleitner
            [2] => Eva Marie Saint
            [3] => Kathryn Holcomb
            [4] => William Kirby Cullen
            [5] => Vicki Schreck
        )

    [Åldersgräns] => Array
        (
            [0] => 15 år.
        )

)

Guldstrand · December 13, 2009

I´m really grateful for your help.

It seems that i can´t get your code to work, i´m only getting this:

Array
(

)

This is the code i´m using:

$html = 'http://www.discshop.se/shop/ds_produkt.php?lang=&id=76317&lang=se&subsite=movies&&ref='; 

preg_match_all('~<div class="item1"><b class="txt_grey">([^:]+):</b></div>\s*<div class="item2">(.*?)</div>~is', $html, $matches, PREG_SET_ORDER);
$data = array();
foreach ($matches as $arr) {
   $data[$arr[1]] = explode('<br>', $arr[2]);
   foreach ($data[$arr[1]] as $key => &$value) {
      $value = trim(strip_tags($value));
      if ($value == '') {
         unset($data[$arr[1]][$key]);
      }
   }
}
echo '<pre>' . print_r($data, true) . '</pre>';

The info i´m after, is at the bottom of the page above. (see screen)

[attachment deleted by admin]

thebadbad · December 13, 2009

You need to read the contents of the file into the variable. Either use file_get_contents() or cURL. And I noticed that the site uses <br/> instead of <br> for line breaks, so you'll have to change that in the code (the parameter for explode() inside the foreach loop).

Guldstrand · December 13, 2009

You need to read the contents of the file into the variable. Either use file_get_contents() or cURL. And I noticed that the site uses <br/> instead of <br> for line breaks, so you'll have to change that in the code (the parameter for explode() inside the foreach loop).

Now it works much better.

Thanks!

Guldstrand · December 16, 2009

Assuming the source looks more or less strictly like that for every film/series, here's a way to do it:

<?php
//$html holds the source code
preg_match_all('~<div class="item1"><b class="txt_grey">([^:]+):</b></div>\s*<div class="item2">(.*?)</div>~is', $html, $matches, PREG_SET_ORDER);
$data = array();
foreach ($matches as $arr) {
$data[$arr[1]] = explode('<br>', $arr[2]);
foreach ($data[$arr[1]] as $key => &$value) {
	$value = trim(strip_tags($value));
	if ($value == '') {
		unset($data[$arr[1]][$key]);
	}
}
}
echo '<pre>' . print_r($data, true) . '</pre>';
?>

Hi again...

Is there a quick and easy way to show the output in a nicer/better way? :shy:

thebadbad · December 16, 2009

Sure, you can e.g. output the data in a table (in its most simple form in this example):

<?php
echo '<table>';
foreach ($data as $key => $val) {
echo "\n\t<tr>\n\t\t<td>$key</td><td>" . implode('<br />', $val) . "</td>\n\t</tr>";
}
echo "\n</table>";
?>

Guldstrand · December 16, 2009

Sure, you can e.g. output the data in a table (in its most simple form in this example):
<?php
echo '<table>';
foreach ($data as $key => $val) {
echo "\n\t<tr>\n\t\t<td>$key</td><td>" . implode('<br />', $val) . "</td>\n\t</tr>";
}
echo "\n</table>";
?>

Woow.. thanks.

I can´t thank you enough.

One last thing tho.. :-[

How to get the image (both thumb and full size) and the price (299:-) WITHOUT the ":-" sign, from the same url?

Ans is it possible to have more regexp:s in the same "instance", or do i need to create a whole new one?

Here is the html:

<img style="vertical-align: bottom;" src="http://www.discshop.se/shop/img/omslag/front_normal/7/76317.jpg" class="reflected" alt="Familjen Macahan - Säsong 1 (4-disc)" border="0" height="170" hspace="0" vspace="0" width="120"><canvas width="120" height="34" style="height: 34px; width: 120px;"></canvas></div></a></div>
<a href="javascript:void(0);" onclick="window.open('coverview.php?id=76317&side=front','bakom','height=350,width=500,status=no,toolbar=no,directories=no,menubar=no,location=no,resizable=yes,scrollbars=no');">Visa stor framsida</a><br>

</div>

<div style="margin-bottom: 10px;">

<div style="margin-bottom: 5px;">

<span class="price " style="margin-bottom: 5px;"><span class="price_normal">299:-</span>

thebadbad · December 16, 2009

I would probably do that with two separate patterns:

<?php
//get image link(s)
preg_match('~<img src="http://www\.discshop\.se/shop/img/omslag/front_normal/([^"]+)"~i', $html, $match);
//build image URLs
$thumb = 'http://www.discshop.se/shop/img/omslag/front_normal/' . $match[1];
$full = 'http://www.discshop.se/shop/img/omslag/front_large/' . $match[1];
//get price
preg_match('~<span class="price "[^>]*><span\b[^>]*>([^:]+):~i', $html, $match);
$price = $match[1];
?>

Guldstrand · December 26, 2009

Hi again...

What if i want to show (parse) search results from multiple sites? :shy:

Site 1

HTML-output:

<th colspan="5">Filmtitel - 10 träffar</th>

Site 2

HTML-output:

<div style="margin: 8px; float: right; color: rgb(102, 102, 102);"><b>10 träffar</b></div>

Site 3

HTML-output:

<span id="ctl00_ContentPlaceHolder1_m_nbrofHitsOnWhat">Din sökning på "scrubs" resulterade i 8 träffar</span>

Site 4

HTML-output:

Visar 1 - 2 av <strong>2</strong> annonser

Site 5

HTML-output:

<strong>Dvd- & vhs-filmer (1)</strong>

Site 6

HTML-output:

<span id="SearchResultMessage">
9

objekt hittade

för "scrubs" i kategorin DVD & Videofilmer</span>

Guldstrand · December 29, 2009

*bump*

cags · December 29, 2009

It's impossible unless you provide a specific set of rules for each individual site. There is no possible way a computer could work out which part is the correct number from a single pattern.

Guldstrand · December 29, 2009

It's impossible unless you provide a specific set of rules for each individual site. There is no possible way a computer could work out which part is the correct number from a single pattern.

Yes, i know that i somehow need to write a regexp for each site.

But after that, how can i show all results the best/easiest way?

cags · December 29, 2009

The same way your currently doing it, only adding all results in. Regardless of the pattern used I assume you are going to be collecting the same information from each ones ie bus number, place etc. Therefore you will have the same number of capture groups. You can simply use array_merge to combine all the result sets into a large array, then simply loop through the array in the same manner you are currently using. If you find that the capture groups aren't in the same place you can perhaps use named capture groups so that the information can still be easily iterated through.

Sign In

Help with creating regexp

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information