Jump to content

How to find multiple things with preg_match_all


Lukeidiot

Recommended Posts

<tr><td></td><td colspan=2>
<table border=0 cellpadding=0 cellspacing=0 width="100%"><tr><td class="rules"><LI>Supported file types are: GIF, JPG, PNG 
<LI>Maximum file size allowed is 2048 KB. 
<LI>Images greater than 250x250 pixels will be thumbnailed. 
<LI>Read the <a href="http://www.4chan.org/rules#b">rules</a> and <a href="http://www.4chan.org/faq">FAQ</a> before posting. 

<LI><img src="http://static.4chan.org/image/jpn-flag.jpg" width="17" height="11"> <a href="http://www.4chan.org/japanese">このサイトについて</a> - 
<a href="http://www.nifty.com/globalgate/">翻訳</a><LI>Currently <b>2786</b> unique user posts.</td><td align="right" valign="center"></td></tr></table></td></tr></table></form></div><hr>
<script>with(document.post) {name.value=get_cookie("4chan_name"); email.value=get_cookie("4chan_email"); pwd.value=get_pass("4chan_pass"); }</script>
<form name="delform" action="http://sys.4chan.org/b/imgboard.php" method=POST><span class="filesize">File : <a href="http://images.4chan.org/b/src/1326189762932.jpg" target="_blank">1326189762.jpg</a>-(32 KB, 555x691)</span><br><a href="http://images.4chan.org/b/src/1326189762932.jpg" target=_blank><img src=http://1.thumbs.4chan.org/b/thumb/1326189762932s.jpg border=0 align=left width=202 height=251 hspace=20 alt="32 KB" md5="VNdi/JU72ZjPDqPFj8GimQ=="></a><a name="0"></a>
<input type=checkbox name="373301167" value=delete><span class="filetitle"></span> 

<span class="postername">Anonymous</span> <span class="posttime">01/10/12(Tue)05:02:42</span> <span id="nothread373301167"><a href="res/373301167#373301167" class="quotejs">No.</a><a href="res/373301167#q373301167" class="quotejs">373301167</a>    [<a href="res/373301167">Reply</a>]</span>
<blockquote>so i just lost my virginity and i didnt last very long at all&#44; like maybe a minute. is it just because its my first time&#44; or do i have a problem.</blockquote><a name="373301257"></a>

<table><tr><td nowrap class="doubledash">>></td><td id="373301257" class="reply">
<input type=checkbox name="373301257" value=delete><span class="replytitle"></span> 
<span class="commentpostername">Anonymous</span> 01/10/12(Tue)05:03:26 <span id="norep373301257"><a href="res/373301167#373301257" class="quotejs">No.</a><a href="res/373301167#q373301257" class="quotejs">373301257</a></span><br>     <span class="filesize">File<a href="http://images.4chan.org/b/src/1326189806323.jpg" target="_blank">1326189806.jpg</a>-(10 KB, 320x240)</span><br><a href="http://images.4chan.org/b/src/1326189806323.jpg" target=_blank><img src=http://0.thumbs.4chan.org/b/thumb/1326189806323s.jpg border=0 align=left width=126 height=95 hspace=20 alt="10 KB" md5="Iy9EbXzWXglnvuLciY0jUg=="></a><blockquote>http://golink.us/mll/vkap</blockquote></td></tr></table>
<a name="373301298"></a>
<table><tr><td nowrap class="doubledash">>></td><td id="373301298" class="reply">
<input type=checkbox name="373301298" value=delete><span class="replytitle"></span> 
<span class="commentpostername">Anonymous</span> 01/10/12(Tue)05:03:52 <span id="norep373301298"><a href="res/373301167#373301298" class="quotejs">No.</a><a href="res/373301167#q373301298" class="quotejs">373301298</a></span><blockquote>what are you&#44; 14?...<br />YES it's normal.</blockquote></td></tr></table>

<a name="373301334"></a>
<table><tr><td nowrap class="doubledash">>></td><td id="373301334" class="reply">
<input type=checkbox name="373301334" value=delete><span class="replytitle"></span> 
<span class="commentpostername">Anonymous</span> 01/10/12(Tue)05:04:08 <span id="norep373301334"><a href="res/373301167#373301334" class="quotejs">No.</a><a href="res/373301167#q373301334" class="quotejs">373301334</a></span><blockquote>maybe that guy in ur ass was the one having a problem</blockquote></td></tr></table>
<br clear=left><hr>
<span class="filesize">File : <a href="http://images.4chan.org/b/src/1326189185164.png" target="_blank">1326189185.png</a>-(60 KB, 591x638)</span><br><a href="http://images.4chan.org/b/src/1326189185164.png" target=_blank><img src=http://1.thumbs.4chan.org/b/thumb/1326189185164s.jpg border=0 align=left width=233 height=251 hspace=20 alt="60 KB" md5="46ODgCAcN48NJ6Nvh+I7gg=="></a><a name="0"></a>
<input type=checkbox name="373300183" value=delete><span class="filetitle"></span> 
<span class="postername">Anonymous</span> <span class="posttime">01/10/12(Tue)04:53:05</span> <span id="nothread373300183"><a href="res/373300183#373300183" class="quotejs">No.</a><a href="res/373300183#q373300183" class="quotejs">373300183</a>    [<a href="res/373300183">Reply</a>]</span>

<blockquote>Get in here!<br /><br />Let's chill and have fun.<br /><br />Don't come in if you're going to be boringly quiet. <br /><br />Be lively!</blockquote><span class="omittedposts">15 posts and 8 image replies omitted. Click Reply to view.</span>
<a name="373301286"></a>
<table><tr><td nowrap class="doubledash">>></td><td id="373301286" class="reply">
<input type=checkbox name="373301286" value=delete><span class="replytitle"></span> 
<span class="commentpostername">Anonymous</span> 01/10/12(Tue)05:03:48 <span id="norep373301286"><a href="res/373300183#373301286" class="quotejs">No.</a><a href="res/373300183#q373301286" class="quotejs">373301286</a></span><blockquote>anthonyismelo</blockquote></td></tr></table>
<a name="373301297"></a>

<table><tr><td nowrap class="doubledash">>></td><td id="373301297" class="reply">
<input type=checkbox name="373301297" value=delete><span class="replytitle"></span> 
<span class="commentpostername">Anonymous</span> 01/10/12(Tue)05:03:51 <span id="norep373301297"><a href="res/373300183#373301297" class="quotejs">No.</a><a href="res/373300183#q373301297" class="quotejs">373301297</a></span><blockquote>puppiie1<br />add me</blockquote></td></tr></table>
<a name="373301329"></a>

 

Basically, when you search for "boringly" (Or any word(s) in the thread), I need it to pull the thread ID for which it was posted in. Here is a visual also (The source code of the webpage I am searching is above).

 

9Qj7e.png

Link to comment
Share on other sites

Because I don't know how you are handling the keyword(s) to look for, and how you are looking to handle results, you may need to tweak this to suit your needs...but here is a function to get you started, based on the content you provided.

 

/**
* get_ids_by_keyword returns a list of thread IDs based off given keywords

* @param string $content   is the content to scrape
* @param mixed  $keyWords  is a string (for a single keyword) or an array of 
*                          key words to search. 
* @param bool   $pMatch    is a flag to determine whether or not to match full
*                          or partial words. Default is full match
* @param bool   $cs        is a flag to determine whether or not search should
*                          be case-sensitive. Default is case-insensitive
* return mixed  If results found, a multi-dim associative array is returned,
*               following this format: keyword => (id => instances).
*               NOTE: if $cs==false, keyword is returned lowercased.
*               If no keywords were matched, return false. 
*/
function get_ids_by_keyword ($content, $keyWords, $pMatch = false, $cs = false) {
  if (!is_array($keyWords)) $keyWords = array($keyWords);
  $keyWords = implode('|', $keyWords);
  $results = array();
  $b = ($pMatch == false) ? '\b' : '';
  $c = ($cs == false) ? 'i' : '';

  preg_match_all('~<span class="(?:comment)?postername">.*?<span id="no(?:thread|rep)(\d+)">.*?<blockquote>(.*?)</blockquote>~s',$content,$posts);
  	
  foreach ($posts[2] as $i => $post) {
    if ( preg_match_all('~'.$b.'('.$keyWords.')'.$b.'~'.$c, $post, $keyWordFound) ) {
      foreach ($keyWordFound[1] as $kwf) {
        if ($c) $kwf = strtolower($kwf);
        $cid = $posts[1][$i];
        $results[$kwf][$cid] = ( isset($results[$kwf][$cid]) ) ? ++$results[$kwf][$cid] : 1; 
      }
    }
  }
  return ( count($results)>0 ) ? $results : false;
} // end get_ids_by_keyword

 

Basically you call the function passing the content to be scraped, and the keyword(s) you want to look for.  It will then return a multi-dim associative array of the keyword(s) found, the post id of the post (including replies to posts) the keyword was found in, and how many instances of the keyword was found in the post. 

 

By default the function performs a case-insenstive full word search.  There are optional arguments to make it case-sensitive and also return match for partial matches. 

 

Example 1: case in-sensitive full word search of 4 words within content of OP example.  Notice how "boring" is not returned because the only instance similar within the content is "boringly" but this is flagged as a full word match.

$keyWords = array('a','problem','boring','let');
print_r(get_ids_by_keyword($content, $keyWords));

Output

Array
(
    [a] => Array
        (
            [373301167] => 2
            [373301334] => 1
        )

    [problem] => Array
        (
            [373301167] => 1
            [373301334] => 1
        )

    [let] => Array
        (
            [373300183] => 1
        )

)

 

 

Example 2: case in-sensitive partial word search of same 4 words.  Now "boring" is matched because we flag it for partial matches.  Also we get a lot of matches on "a" because it found a lot of them within words.

$keyWords = array('a','problem','boring','let');
print_r(get_ids_by_keyword($content, $keyWords, true));

Output

Array
(
    [a] => Array
        (
            [373301167] => 9
            [373301257] => 1
            [373301298] => 3
            [373301334] => 6
            [373300183] => 2
            [373301286] => 1
            [373301297] => 1
        )

    [problem] => Array
        (
            [373301167] => 1
            [373301334] => 1
        )

    [let] => Array
        (
            [373300183] => 1
        )

    [boring] => Array
        (
            [373300183] => 1
        )

)

 

 

Example 2: case sensitive partial word search. The first keyword is now capitalized. 

$keyWords = array('A','problem','boring','let');
print_r(get_ids_by_keyword($content, $keyWords, false, true));

Output:

Array
(
    [problem] => Array
        (
            [373301167] => 1
            [373301334] => 1
        )

)

Link to comment
Share on other sites

Because I don't know how you are handling the keyword(s) to look for, and how you are looking to handle results, you may need to tweak this to suit your needs...but here is a function to get you started, based on the content you provided.

 

/**
* get_ids_by_keyword returns a list of thread IDs based off given keywords

* @param string $content   is the content to scrape
* @param mixed  $keyWords  is a string (for a single keyword) or an array of 
*                          key words to search. 
* @param bool   $pMatch    is a flag to determine whether or not to match full
*                          or partial words. Default is full match
* @param bool   $cs        is a flag to determine whether or not search should
*                          be case-sensitive. Default is case-insensitive
* return mixed  If results found, a multi-dim associative array is returned,
*               following this format: keyword => (id => instances).
*               NOTE: if $cs==false, keyword is returned lowercased.
*               If no keywords were matched, return false. 
*/
function get_ids_by_keyword ($content, $keyWords, $pMatch = false, $cs = false) {
  if (!is_array($keyWords)) $keyWords = array($keyWords);
  $keyWords = implode('|', $keyWords);
  $results = array();
  $b = ($pMatch == false) ? '\b' : '';
  $c = ($cs == false) ? 'i' : '';

  preg_match_all('~<span class="(?:comment)?postername">.*?<span id="no(?:thread|rep)(\d+)">.*?<blockquote>(.*?)</blockquote>~s',$content,$posts);
  	
  foreach ($posts[2] as $i => $post) {
    if ( preg_match_all('~'.$b.'('.$keyWords.')'.$b.'~'.$c, $post, $keyWordFound) ) {
      foreach ($keyWordFound[1] as $kwf) {
        if ($c) $kwf = strtolower($kwf);
        $cid = $posts[1][$i];
        $results[$kwf][$cid] = ( isset($results[$kwf][$cid]) ) ? ++$results[$kwf][$cid] : 1; 
      }
    }
  }
  return ( count($results)>0 ) ? $results : false;
} // end get_ids_by_keyword

 

Basically you call the function passing the content to be scraped, and the keyword(s) you want to look for.  It will then return a multi-dim associative array of the keyword(s) found, the post id of the post (including replies to posts) the keyword was found in, and how many instances of the keyword was found in the post. 

 

By default the function performs a case-insenstive full word search.  There are optional arguments to make it case-sensitive and also return match for partial matches. 

 

Example 1: case in-sensitive full word search of 4 words within content of OP example.  Notice how "boring" is not returned because the only instance similar within the content is "boringly" but this is flagged as a full word match.

$keyWords = array('a','problem','boring','let');
print_r(get_ids_by_keyword($content, $keyWords));

Output

Array
(
    [a] => Array
        (
            [373301167] => 2
            [373301334] => 1
        )

    [problem] => Array
        (
            [373301167] => 1
            [373301334] => 1
        )

    [let] => Array
        (
            [373300183] => 1
        )

)

 

 

Example 2: case in-sensitive partial word search of same 4 words.  Now "boring" is matched because we flag it for partial matches.  Also we get a lot of matches on "a" because it found a lot of them within words.

$keyWords = array('a','problem','boring','let');
print_r(get_ids_by_keyword($content, $keyWords, true));

Output

Array
(
    [a] => Array
        (
            [373301167] => 9
            [373301257] => 1
            [373301298] => 3
            [373301334] => 6
            [373300183] => 2
            [373301286] => 1
            [373301297] => 1
        )

    [problem] => Array
        (
            [373301167] => 1
            [373301334] => 1
        )

    [let] => Array
        (
            [373300183] => 1
        )

    [boring] => Array
        (
            [373300183] => 1
        )

)

 

 

Example 2: case sensitive partial word search. The first keyword is now capitalized. 

$keyWords = array('A','problem','boring','let');
print_r(get_ids_by_keyword($content, $keyWords, false, true));

Output:

Array
(
    [problem] => Array
        (
            [373301167] => 1
            [373301334] => 1
        )

)

 

Thank you kind sir.

 

Here is what I came up with (even though it kinda sucks).

 

<?php
error_reporting(0);
if(isset($_POST['Search'])){
$board = $_POST['board'];
$search = $_POST['txtSearch'];

if($search == ''){
echo "Please enter a search term.";
} else {

$i = -1;
while($i < 15){
$curlKey = curl_init();
curl_setopt($curlKey, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curlKey, CURLOPT_HEADER, 0);
curl_setopt($curlKey, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curlKey, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.44 Safari/534.7");
curl_setopt($curlKey, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curlKey, CURLOPT_COOKIEFILE, "b.txt");
curl_setopt($curlKey, CURLOPT_COOKIEJAR, "b.txt");
curl_setopt($curlKey, CURLOPT_URL, "http://boards.4chan.org/$board/$i");

set_time_limit(0);
$loginKey = curl_exec ($curlKey);

preg_match_all("/$search/", $loginKey, $matches);
$results = $matches[0][0];
$i++;

$keyWords = array($search, $search, $search, $search);
$thread_id = get_ids_by_keyword($loginKey, $keyWords, true);

if(strtolower($search) == strtolower($results)){
foreach($thread_id[$search] as $ss => $s){
	echo ("(<strong>$search</strong>) Found: <a href='http://boards.4chan.org/$board/res/$ss'>http://boards.4chan.org/$board/res/$ss</a> on <strong>$board</strong><br>");
}
$search = $search;
$timestamp = time();
$ipaddress = $_SERVER['REMOTE_ADDR'];
mysql_query("INSERT INTO searches (search, ipaddress, timestamp, found, chan) VALUES ('$search','$ipaddress','$timestamp','1', '$board')");


} else {
$search = $search;
$timestamp = time();
$ipaddress = $_SERVER['REMOTE_ADDR'];
echo "(Page $i) No matches found for <strong>$search</strong> on <strong>/$board/</strong><br>";
mysql_query("INSERT INTO searches (search, ipaddress, timestamp, found, chan) VALUES ('$search','$ipaddress','$timestamp','0','$board')");
}
}
}
} else {
}
?>

 

Site: http://searchchan.com

 

If you feel like chatting, I'd love to learn a bit on Skype: Rider1337

 

Thanks.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.