Lukeidiot Posted January 10, 2012 Share Posted January 10, 2012 <tr><td></td><td colspan=2> <table border=0 cellpadding=0 cellspacing=0 width="100%"><tr><td class="rules"><LI>Supported file types are: GIF, JPG, PNG <LI>Maximum file size allowed is 2048 KB. <LI>Images greater than 250x250 pixels will be thumbnailed. <LI>Read the <a href="http://www.4chan.org/rules#b">rules</a> and <a href="http://www.4chan.org/faq">FAQ</a> before posting. <LI><img src="http://static.4chan.org/image/jpn-flag.jpg" width="17" height="11"> <a href="http://www.4chan.org/japanese">このサイトについて</a> - <a href="http://www.nifty.com/globalgate/">翻訳</a><LI>Currently <b>2786</b> unique user posts.</td><td align="right" valign="center"></td></tr></table></td></tr></table></form></div><hr> <script>with(document.post) {name.value=get_cookie("4chan_name"); email.value=get_cookie("4chan_email"); pwd.value=get_pass("4chan_pass"); }</script> <form name="delform" action="http://sys.4chan.org/b/imgboard.php" method=POST><span class="filesize">File : <a href="http://images.4chan.org/b/src/1326189762932.jpg" target="_blank">1326189762.jpg</a>-(32 KB, 555x691)</span><br><a href="http://images.4chan.org/b/src/1326189762932.jpg" target=_blank><img src=http://1.thumbs.4chan.org/b/thumb/1326189762932s.jpg border=0 align=left width=202 height=251 hspace=20 alt="32 KB" md5="VNdi/JU72ZjPDqPFj8GimQ=="></a><a name="0"></a> <input type=checkbox name="373301167" value=delete><span class="filetitle"></span> <span class="postername">Anonymous</span> <span class="posttime">01/10/12(Tue)05:02:42</span> <span id="nothread373301167"><a href="res/373301167#373301167" class="quotejs">No.</a><a href="res/373301167#q373301167" class="quotejs">373301167</a> [<a href="res/373301167">Reply</a>]</span> <blockquote>so i just lost my virginity and i didnt last very long at all, like maybe a minute. is it just because its my first time, or do i have a problem.</blockquote><a name="373301257"></a> <table><tr><td nowrap class="doubledash">>></td><td id="373301257" class="reply"> <input type=checkbox name="373301257" value=delete><span class="replytitle"></span> <span class="commentpostername">Anonymous</span> 01/10/12(Tue)05:03:26 <span id="norep373301257"><a href="res/373301167#373301257" class="quotejs">No.</a><a href="res/373301167#q373301257" class="quotejs">373301257</a></span><br> <span class="filesize">File<a href="http://images.4chan.org/b/src/1326189806323.jpg" target="_blank">1326189806.jpg</a>-(10 KB, 320x240)</span><br><a href="http://images.4chan.org/b/src/1326189806323.jpg" target=_blank><img src=http://0.thumbs.4chan.org/b/thumb/1326189806323s.jpg border=0 align=left width=126 height=95 hspace=20 alt="10 KB" md5="Iy9EbXzWXglnvuLciY0jUg=="></a><blockquote>http://golink.us/mll/vkap</blockquote></td></tr></table> <a name="373301298"></a> <table><tr><td nowrap class="doubledash">>></td><td id="373301298" class="reply"> <input type=checkbox name="373301298" value=delete><span class="replytitle"></span> <span class="commentpostername">Anonymous</span> 01/10/12(Tue)05:03:52 <span id="norep373301298"><a href="res/373301167#373301298" class="quotejs">No.</a><a href="res/373301167#q373301298" class="quotejs">373301298</a></span><blockquote>what are you, 14?...<br />YES it's normal.</blockquote></td></tr></table> <a name="373301334"></a> <table><tr><td nowrap class="doubledash">>></td><td id="373301334" class="reply"> <input type=checkbox name="373301334" value=delete><span class="replytitle"></span> <span class="commentpostername">Anonymous</span> 01/10/12(Tue)05:04:08 <span id="norep373301334"><a href="res/373301167#373301334" class="quotejs">No.</a><a href="res/373301167#q373301334" class="quotejs">373301334</a></span><blockquote>maybe that guy in ur ass was the one having a problem</blockquote></td></tr></table> <br clear=left><hr> <span class="filesize">File : <a href="http://images.4chan.org/b/src/1326189185164.png" target="_blank">1326189185.png</a>-(60 KB, 591x638)</span><br><a href="http://images.4chan.org/b/src/1326189185164.png" target=_blank><img src=http://1.thumbs.4chan.org/b/thumb/1326189185164s.jpg border=0 align=left width=233 height=251 hspace=20 alt="60 KB" md5="46ODgCAcN48NJ6Nvh+I7gg=="></a><a name="0"></a> <input type=checkbox name="373300183" value=delete><span class="filetitle"></span> <span class="postername">Anonymous</span> <span class="posttime">01/10/12(Tue)04:53:05</span> <span id="nothread373300183"><a href="res/373300183#373300183" class="quotejs">No.</a><a href="res/373300183#q373300183" class="quotejs">373300183</a> [<a href="res/373300183">Reply</a>]</span> <blockquote>Get in here!<br /><br />Let's chill and have fun.<br /><br />Don't come in if you're going to be boringly quiet. <br /><br />Be lively!</blockquote><span class="omittedposts">15 posts and 8 image replies omitted. Click Reply to view.</span> <a name="373301286"></a> <table><tr><td nowrap class="doubledash">>></td><td id="373301286" class="reply"> <input type=checkbox name="373301286" value=delete><span class="replytitle"></span> <span class="commentpostername">Anonymous</span> 01/10/12(Tue)05:03:48 <span id="norep373301286"><a href="res/373300183#373301286" class="quotejs">No.</a><a href="res/373300183#q373301286" class="quotejs">373301286</a></span><blockquote>anthonyismelo</blockquote></td></tr></table> <a name="373301297"></a> <table><tr><td nowrap class="doubledash">>></td><td id="373301297" class="reply"> <input type=checkbox name="373301297" value=delete><span class="replytitle"></span> <span class="commentpostername">Anonymous</span> 01/10/12(Tue)05:03:51 <span id="norep373301297"><a href="res/373300183#373301297" class="quotejs">No.</a><a href="res/373300183#q373301297" class="quotejs">373301297</a></span><blockquote>puppiie1<br />add me</blockquote></td></tr></table> <a name="373301329"></a> Basically, when you search for "boringly" (Or any word(s) in the thread), I need it to pull the thread ID for which it was posted in. Here is a visual also (The source code of the webpage I am searching is above). Quote Link to comment Share on other sites More sharing options...
.josh Posted January 10, 2012 Share Posted January 10, 2012 Because I don't know how you are handling the keyword(s) to look for, and how you are looking to handle results, you may need to tweak this to suit your needs...but here is a function to get you started, based on the content you provided. /** * get_ids_by_keyword returns a list of thread IDs based off given keywords * @param string $content is the content to scrape * @param mixed $keyWords is a string (for a single keyword) or an array of * key words to search. * @param bool $pMatch is a flag to determine whether or not to match full * or partial words. Default is full match * @param bool $cs is a flag to determine whether or not search should * be case-sensitive. Default is case-insensitive * return mixed If results found, a multi-dim associative array is returned, * following this format: keyword => (id => instances). * NOTE: if $cs==false, keyword is returned lowercased. * If no keywords were matched, return false. */ function get_ids_by_keyword ($content, $keyWords, $pMatch = false, $cs = false) { if (!is_array($keyWords)) $keyWords = array($keyWords); $keyWords = implode('|', $keyWords); $results = array(); $b = ($pMatch == false) ? '\b' : ''; $c = ($cs == false) ? 'i' : ''; preg_match_all('~<span class="(?:comment)?postername">.*?<span id="no(?:thread|rep)(\d+)">.*?<blockquote>(.*?)</blockquote>~s',$content,$posts); foreach ($posts[2] as $i => $post) { if ( preg_match_all('~'.$b.'('.$keyWords.')'.$b.'~'.$c, $post, $keyWordFound) ) { foreach ($keyWordFound[1] as $kwf) { if ($c) $kwf = strtolower($kwf); $cid = $posts[1][$i]; $results[$kwf][$cid] = ( isset($results[$kwf][$cid]) ) ? ++$results[$kwf][$cid] : 1; } } } return ( count($results)>0 ) ? $results : false; } // end get_ids_by_keyword Basically you call the function passing the content to be scraped, and the keyword(s) you want to look for. It will then return a multi-dim associative array of the keyword(s) found, the post id of the post (including replies to posts) the keyword was found in, and how many instances of the keyword was found in the post. By default the function performs a case-insenstive full word search. There are optional arguments to make it case-sensitive and also return match for partial matches. Example 1: case in-sensitive full word search of 4 words within content of OP example. Notice how "boring" is not returned because the only instance similar within the content is "boringly" but this is flagged as a full word match. $keyWords = array('a','problem','boring','let'); print_r(get_ids_by_keyword($content, $keyWords)); Output Array ( [a] => Array ( [373301167] => 2 [373301334] => 1 ) [problem] => Array ( [373301167] => 1 [373301334] => 1 ) [let] => Array ( [373300183] => 1 ) ) Example 2: case in-sensitive partial word search of same 4 words. Now "boring" is matched because we flag it for partial matches. Also we get a lot of matches on "a" because it found a lot of them within words. $keyWords = array('a','problem','boring','let'); print_r(get_ids_by_keyword($content, $keyWords, true)); Output Array ( [a] => Array ( [373301167] => 9 [373301257] => 1 [373301298] => 3 [373301334] => 6 [373300183] => 2 [373301286] => 1 [373301297] => 1 ) [problem] => Array ( [373301167] => 1 [373301334] => 1 ) [let] => Array ( [373300183] => 1 ) [boring] => Array ( [373300183] => 1 ) ) Example 2: case sensitive partial word search. The first keyword is now capitalized. $keyWords = array('A','problem','boring','let'); print_r(get_ids_by_keyword($content, $keyWords, false, true)); Output: Array ( [problem] => Array ( [373301167] => 1 [373301334] => 1 ) ) Quote Link to comment Share on other sites More sharing options...
Lukeidiot Posted January 11, 2012 Author Share Posted January 11, 2012 Because I don't know how you are handling the keyword(s) to look for, and how you are looking to handle results, you may need to tweak this to suit your needs...but here is a function to get you started, based on the content you provided. /** * get_ids_by_keyword returns a list of thread IDs based off given keywords * @param string $content is the content to scrape * @param mixed $keyWords is a string (for a single keyword) or an array of * key words to search. * @param bool $pMatch is a flag to determine whether or not to match full * or partial words. Default is full match * @param bool $cs is a flag to determine whether or not search should * be case-sensitive. Default is case-insensitive * return mixed If results found, a multi-dim associative array is returned, * following this format: keyword => (id => instances). * NOTE: if $cs==false, keyword is returned lowercased. * If no keywords were matched, return false. */ function get_ids_by_keyword ($content, $keyWords, $pMatch = false, $cs = false) { if (!is_array($keyWords)) $keyWords = array($keyWords); $keyWords = implode('|', $keyWords); $results = array(); $b = ($pMatch == false) ? '\b' : ''; $c = ($cs == false) ? 'i' : ''; preg_match_all('~<span class="(?:comment)?postername">.*?<span id="no(?:thread|rep)(\d+)">.*?<blockquote>(.*?)</blockquote>~s',$content,$posts); foreach ($posts[2] as $i => $post) { if ( preg_match_all('~'.$b.'('.$keyWords.')'.$b.'~'.$c, $post, $keyWordFound) ) { foreach ($keyWordFound[1] as $kwf) { if ($c) $kwf = strtolower($kwf); $cid = $posts[1][$i]; $results[$kwf][$cid] = ( isset($results[$kwf][$cid]) ) ? ++$results[$kwf][$cid] : 1; } } } return ( count($results)>0 ) ? $results : false; } // end get_ids_by_keyword Basically you call the function passing the content to be scraped, and the keyword(s) you want to look for. It will then return a multi-dim associative array of the keyword(s) found, the post id of the post (including replies to posts) the keyword was found in, and how many instances of the keyword was found in the post. By default the function performs a case-insenstive full word search. There are optional arguments to make it case-sensitive and also return match for partial matches. Example 1: case in-sensitive full word search of 4 words within content of OP example. Notice how "boring" is not returned because the only instance similar within the content is "boringly" but this is flagged as a full word match. $keyWords = array('a','problem','boring','let'); print_r(get_ids_by_keyword($content, $keyWords)); Output Array ( [a] => Array ( [373301167] => 2 [373301334] => 1 ) [problem] => Array ( [373301167] => 1 [373301334] => 1 ) [let] => Array ( [373300183] => 1 ) ) Example 2: case in-sensitive partial word search of same 4 words. Now "boring" is matched because we flag it for partial matches. Also we get a lot of matches on "a" because it found a lot of them within words. $keyWords = array('a','problem','boring','let'); print_r(get_ids_by_keyword($content, $keyWords, true)); Output Array ( [a] => Array ( [373301167] => 9 [373301257] => 1 [373301298] => 3 [373301334] => 6 [373300183] => 2 [373301286] => 1 [373301297] => 1 ) [problem] => Array ( [373301167] => 1 [373301334] => 1 ) [let] => Array ( [373300183] => 1 ) [boring] => Array ( [373300183] => 1 ) ) Example 2: case sensitive partial word search. The first keyword is now capitalized. $keyWords = array('A','problem','boring','let'); print_r(get_ids_by_keyword($content, $keyWords, false, true)); Output: Array ( [problem] => Array ( [373301167] => 1 [373301334] => 1 ) ) Thank you kind sir. Here is what I came up with (even though it kinda sucks). <?php error_reporting(0); if(isset($_POST['Search'])){ $board = $_POST['board']; $search = $_POST['txtSearch']; if($search == ''){ echo "Please enter a search term."; } else { $i = -1; while($i < 15){ $curlKey = curl_init(); curl_setopt($curlKey, CURLOPT_SSL_VERIFYPEER, FALSE); curl_setopt($curlKey, CURLOPT_HEADER, 0); curl_setopt($curlKey, CURLOPT_RETURNTRANSFER, true); curl_setopt($curlKey, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.44 Safari/534.7"); curl_setopt($curlKey, CURLOPT_FOLLOWLOCATION, true); curl_setopt($curlKey, CURLOPT_COOKIEFILE, "b.txt"); curl_setopt($curlKey, CURLOPT_COOKIEJAR, "b.txt"); curl_setopt($curlKey, CURLOPT_URL, "http://boards.4chan.org/$board/$i"); set_time_limit(0); $loginKey = curl_exec ($curlKey); preg_match_all("/$search/", $loginKey, $matches); $results = $matches[0][0]; $i++; $keyWords = array($search, $search, $search, $search); $thread_id = get_ids_by_keyword($loginKey, $keyWords, true); if(strtolower($search) == strtolower($results)){ foreach($thread_id[$search] as $ss => $s){ echo ("(<strong>$search</strong>) Found: <a href='http://boards.4chan.org/$board/res/$ss'>http://boards.4chan.org/$board/res/$ss</a> on <strong>$board</strong><br>"); } $search = $search; $timestamp = time(); $ipaddress = $_SERVER['REMOTE_ADDR']; mysql_query("INSERT INTO searches (search, ipaddress, timestamp, found, chan) VALUES ('$search','$ipaddress','$timestamp','1', '$board')"); } else { $search = $search; $timestamp = time(); $ipaddress = $_SERVER['REMOTE_ADDR']; echo "(Page $i) No matches found for <strong>$search</strong> on <strong>/$board/</strong><br>"; mysql_query("INSERT INTO searches (search, ipaddress, timestamp, found, chan) VALUES ('$search','$ipaddress','$timestamp','0','$board')"); } } } } else { } ?> Site: http://searchchan.com If you feel like chatting, I'd love to learn a bit on Skype: Rider1337 Thanks. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.