Jump to content

Recommended Posts

Can someone please help with creating a regexp for the following html-code!?

 

    <div class="item1"><b class="txt_grey">Svensk titel:</b></div>
    <div class="item2">Familjen Macahan</div>

    <div class="clfix"></div>
</div>
<div class="ds_spec_380_2">
	<div class="item1"><b class="txt_grey">Originaltitel:</b></div>
	<div class="item2">How the West Was Won</div>
	<div class="clfix"></div>
</div>
<div class="ds_spec_380_2">

    <div class="item1"><b class="txt_grey">Genre:</b></div>
    <div class="item2"><a href="ds.php?red=prod_category.php&&arg=genre@@@tvserie,,,lang@@@se,,,subsite@@@movies,,,">TV-serie</a><a href="ds.php?red=prod_category.php&&arg=genre@@@,,,lang@@@se,,,subsite@@@movies,,,"></a></div>
    <div class="clfix"></div>
</div>		

    <div class="ds_spec_380_2">
        <div class="item1"><b class="txt_grey">Underkategori:</b></div>
        <div class="item2">
            Äventyr<br>Kult (60-80-tal)<br>        </div>

        <div class="clfix"></div>
    </div>	
    <div class="ds_spec_380_2">
        <div class="item1"><b class="txt_grey">Produktionsland:</b></div>
        <div class="item2">
                            <a href="ds.php?red=prod_category.php&&arg=genre@@@world,,,cont@@@land_USA,,,lang@@@se,,,subsite@@@movies,,,">USA</a>
                        </div>
        <div class="clfix"></div>

    </div>
    <div class="ds_spec_380_2">
        <div class="item1"><b class="txt_grey">Inspelningsår:</b></div>
        <div class="item2">1977-1979</div>
        <div class="clfix"></div>
    </div>
    <div class="ds_spec_380_2">
        <div class="item1"><b class="txt_grey">Skådespelare:</b></div>

        <div class="item2">
		<a href="ds.php?red=ds_person.php&&arg=id@@@19550,,,lang@@@se,,,subsite@@@movies,,,">James Arness</a><br><a href="ds.php?red=ds_person.php&&arg=id@@@6256,,,lang@@@se,,,subsite@@@movies,,,">Bruce Boxleitner</a><br><a href="ds.php?red=ds_person.php&&arg=id@@@4393,,,lang@@@se,,,subsite@@@movies,,,">Eva Marie Saint</a><br><a href="ds.php?red=ds_person.php&&arg=id@@@29166,,,lang@@@se,,,subsite@@@movies,,,">Kathryn Holcomb</a><br><a href="ds.php?red=ds_person.php&&arg=id@@@29167,,,lang@@@se,,,subsite@@@movies,,,">William Kirby Cullen</a><br><a href="ds.php?red=ds_person.php&&arg=id@@@29168,,,lang@@@se,,,subsite@@@movies,,,">Vicki Schreck</a>    	</div>
        <div class="clfix"></div>
    </div>
    <div class="ds_spec_380_2">

        <div class="item1"><b class="txt_grey">Åldersgräns:</b></div>
        <div class="item2">
		15 år.<br>        </div>

 

I need to get the following:

# Svensk titel

# Originaltitel

# Genre

# Underkategori

# Produktionsland

# Inspelningsår

# Skådespelare

# Åldersgräns

 

Thanks in advance...

Link to comment
https://forums.phpfreaks.com/topic/184794-help-with-creating-regexp/
Share on other sites

Does this really need to be a job for regex, or would you be open to considering other methods of retrieving that information?

No it doesn´t have to be regexp, if there is a faster and/or better way to parse the info, i will go for that.

Is it or is it not HTML? If it is then it's a markup language and as such can be parsed like one. Certainly if there's anything more you need to do then using some kind of document model would be the way to go, but since you asked, a simple Regular Expression for the pattern would be...

 

preg_match_all('#<div class="item1"><b class="txt_grey">([^:]+):</b></div>#u', $input, $out))

But I'm not saying that's necessarily the right way to go.

Is it or is it not HTML? If it is then it's a markup language and as such can be parsed like one. Certainly if there's anything more you need to do then using some kind of document model would be the way to go, but since you asked, a simple Regular Expression for the pattern would be...

 

preg_match_all('#<div class="item1"><b class="txt_grey">([^:]+):</b></div>#u', $input, $out))

But I'm not saying that's necessarily the right way to go.

Thanks.. but i need to get the values of the words/text added in my first post.

Assuming the source looks more or less strictly like that for every film/series, here's a way to do it:

 

<?php
//$html holds the source code
preg_match_all('~<div class="item1"><b class="txt_grey">([^:]+):</b></div>\s*<div class="item2">(.*?)</div>~is', $html, $matches, PREG_SET_ORDER);
$data = array();
foreach ($matches as $arr) {
$data[$arr[1]] = explode('<br>', $arr[2]);
foreach ($data[$arr[1]] as $key => &$value) {
	$value = trim(strip_tags($value));
	if ($value == '') {
		unset($data[$arr[1]][$key]);
	}
}
}
echo '<pre>' . print_r($data, true) . '</pre>';
?>

 

Output:

Array
(
    [svensk titel] => Array
        (
            [0] => Familjen Macahan
        )

    [Originaltitel] => Array
        (
            [0] => How the West Was Won
        )

    [Genre] => Array
        (
            [0] => TV-serie
        )

    [underkategori] => Array
        (
            [0] => Äventyr
            [1] => Kult (60-80-tal)
        )

    [Produktionsland] => Array
        (
            [0] => USA
        )

    [inspelningsår] => Array
        (
            [0] => 1977-1979
        )

    [skådespelare] => Array
        (
            [0] => James Arness
            [1] => Bruce Boxleitner
            [2] => Eva Marie Saint
            [3] => Kathryn Holcomb
            [4] => William Kirby Cullen
            [5] => Vicki Schreck
        )

    [Åldersgräns] => Array
        (
            [0] => 15 år.
        )

)

I´m really grateful for your help.  :D

It seems that i can´t get your code to work, i´m only getting this:

Array

(

)

 

This is the code i´m using:

$html = 'http://www.discshop.se/shop/ds_produkt.php?lang=&id=76317&lang=se&subsite=movies&&ref='; 

preg_match_all('~<div class="item1"><b class="txt_grey">([^:]+):</b></div>\s*<div class="item2">(.*?)</div>~is', $html, $matches, PREG_SET_ORDER);
$data = array();
foreach ($matches as $arr) {
   $data[$arr[1]] = explode('<br>', $arr[2]);
   foreach ($data[$arr[1]] as $key => &$value) {
      $value = trim(strip_tags($value));
      if ($value == '') {
         unset($data[$arr[1]][$key]);
      }
   }
}
echo '<pre>' . print_r($data, true) . '</pre>';

 

The info i´m after, is at the bottom of the page above. (see screen)

 

[attachment deleted by admin]

You need to read the contents of the file into the variable. Either use file_get_contents() or cURL. And I noticed that the site uses <br/> instead of <br> for line breaks, so you'll have to change that in the code (the parameter for explode() inside the foreach loop).

You need to read the contents of the file into the variable. Either use file_get_contents() or cURL. And I noticed that the site uses <br/> instead of <br> for line breaks, so you'll have to change that in the code (the parameter for explode() inside the foreach loop).

Now it works much better. ;)

Thanks!

Assuming the source looks more or less strictly like that for every film/series, here's a way to do it:

 

<?php
//$html holds the source code
preg_match_all('~<div class="item1"><b class="txt_grey">([^:]+):</b></div>\s*<div class="item2">(.*?)</div>~is', $html, $matches, PREG_SET_ORDER);
$data = array();
foreach ($matches as $arr) {
$data[$arr[1]] = explode('<br>', $arr[2]);
foreach ($data[$arr[1]] as $key => &$value) {
	$value = trim(strip_tags($value));
	if ($value == '') {
		unset($data[$arr[1]][$key]);
	}
}
}
echo '<pre>' . print_r($data, true) . '</pre>';
?>

Hi again...

Is there a quick and easy way to show the output in a nicer/better way?  :shy:

Sure, you can e.g. output the data in a table (in its most simple form in this example):

 

<?php
echo '<table>';
foreach ($data as $key => $val) {
echo "\n\t<tr>\n\t\t<td>$key</td><td>" . implode('<br />', $val) . "</td>\n\t</tr>";
}
echo "\n</table>";
?>

Sure, you can e.g. output the data in a table (in its most simple form in this example):

 

<?php
echo '<table>';
foreach ($data as $key => $val) {
echo "\n\t<tr>\n\t\t<td>$key</td><td>" . implode('<br />', $val) . "</td>\n\t</tr>";
}
echo "\n</table>";
?>

Woow.. thanks.

I can´t thank you enough.

 

One last thing tho..  :-[

How to get the image (both thumb and full size) and the price (299:-) WITHOUT the ":-" sign, from the same url?

Ans is it possible to have more regexp:s in the same "instance", or do i need to create a whole new one?

 

Here is the html:

<img style="vertical-align: bottom;" src="http://www.discshop.se/shop/img/omslag/front_normal/7/76317.jpg" class="reflected" alt="Familjen Macahan - Säsong 1 (4-disc)" border="0" height="170" hspace="0" vspace="0" width="120"><canvas width="120" height="34" style="height: 34px; width: 120px;"></canvas></div></a></div>

<a href="javascript:void(0);" onclick="window.open('coverview.php?id=76317&side=front','bakom','height=350,width=500,status=no,toolbar=no,directories=no,menubar=no,location=no,resizable=yes,scrollbars=no');">Visa stor framsida</a><br>

                    </div>

<div style="margin-bottom: 10px;">

<div style="margin-bottom: 5px;">

<span class="price " style="margin-bottom: 5px;"><span class="price_normal">299:-</span>

I would probably do that with two separate patterns:

 

<?php
//get image link(s)
preg_match('~<img src="http://www\.discshop\.se/shop/img/omslag/front_normal/([^"]+)"~i', $html, $match);
//build image URLs
$thumb = 'http://www.discshop.se/shop/img/omslag/front_normal/' . $match[1];
$full = 'http://www.discshop.se/shop/img/omslag/front_large/' . $match[1];
//get price
preg_match('~<span class="price "[^>]*><span\b[^>]*>([^:]+):~i', $html, $match);
$price = $match[1];
?>

  • 2 weeks later...

Hi again...

 

What if i want to show (parse) search results from multiple sites?  :shy:

 

Site 1

HTML-output:

<th colspan="5">Filmtitel - 10 träffar</th>

 

Site 2

HTML-output:

<div style="margin: 8px; float: right; color: rgb(102, 102, 102);"><b>10 träffar</b></div>

 

Site 3

HTML-output:

<span id="ctl00_ContentPlaceHolder1_m_nbrofHitsOnWhat">Din sökning på "scrubs" resulterade i 8 träffar</span>

 

Site 4

HTML-output:

Visar 1 - 2 av <strong>2</strong> annonser

 

Site 5

HTML-output:

<strong>Dvd- & vhs-filmer (1)</strong>

 

Site 6

HTML-output:

<span id="SearchResultMessage">

                  9

                    objekt hittade

                    för "scrubs" i kategorin DVD & Videofilmer</span>

It's impossible unless you provide a specific set of rules for each individual site. There is no possible way a computer could work out which part is the correct number from a single pattern.

Yes, i know that i somehow need to write a regexp for each site.

But after that, how can i show all results the best/easiest way?

The same way your currently doing it, only adding all results in. Regardless of the pattern used I assume you are going to be collecting the same information from each ones ie bus number, place etc. Therefore you will have the same number of capture groups. You can simply use array_merge to combine all the result sets into a large array, then simply loop through the array in the same manner you are currently using. If you find that the capture groups aren't in the same place you can perhaps use named capture groups so that the information can still be easily iterated through.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.