n8frog Posted September 27, 2010 Share Posted September 27, 2010 You guys here have all been so helpful with all my other questions I thought I should ask for some advice on how to create my web scraper. I would like to create a scraper that will get the html contents of a remote website, then strip all tags, punctuation, and line breaks then format the text into single words separated by a space so the text can be input into a db and used to produce search results. I have tried using the following code I found in a tutorial: To get page content I used: function get_web_page( $url ) { $options = array( CURLOPT_RETURNTRANSFER => true, // return web page CURLOPT_HEADER => false, // don't return headers CURLOPT_FOLLOWLOCATION => true, // follow redirects CURLOPT_ENCODING => "", // handle all encodings CURLOPT_USERAGENT => "spider", // who am i CURLOPT_AUTOREFERER => true, // set referer on redirect CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect CURLOPT_TIMEOUT => 120, // timeout on response CURLOPT_MAXREDIRS => 10, // stop after 10 redirects ); $ch = curl_init( $url ); curl_setopt_array( $ch, $options ); $content = curl_exec( $ch ); $err = curl_errno( $ch ); $errmsg = curl_error( $ch ); $header = curl_getinfo( $ch ); curl_close( $ch ); $header['errno'] = $err; $header['errmsg'] = $errmsg; $header['content'] = $content; return $header; } Then to strip tags I used: function strip_html_tags( $text ) { $text = preg_replace( array( // Remove invisible content '@<head[^>]*?>.*?</head>@siu', '@<style[^>]*?>.*?</style>@siu', '@<script[^>]*?.*?</script>@siu', '@<object[^>]*?.*?</object>@siu', '@<embed[^>]*?.*?</embed>@siu', '@<applet[^>]*?.*?</applet>@siu', '@<noframes[^>]*?.*?</noframes>@siu', '@<noscript[^>]*?.*?</noscript>@siu', '@<noembed[^>]*?.*?</noembed>@siu', // Add line breaks before and after blocks '@</?((address)|(blockquote)|(center)|(del))@iu', '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu', '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu', '@</?((table)|(th)|(td)|(caption))@iu', '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu', '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu', '@</?((frameset)|(frame)|(iframe))@iu', ), array( ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", ), $text ); return strip_tags( $text ); } The problem with using the above code to do this is that I don't understand the code to strip the tags and this code does not format the data quite the way I wanted it. Also when I try to input the result of the strip_html_tags function into my db, all that is input is the word "Array". I decided to echo the contents of the returned variable from the strip_html_tags function and the text of the target website was displayed as expected but looking at the source of the page I found that line breaks were still there as well as ""e" and " ". If someone might be able to show me another way to do this so the code is a little easier to understand or perhaps an explanation on what the confusing strip_html_tags function is doing, I would be grateful. Link to comment https://forums.phpfreaks.com/topic/214542-how-to-make-a-php-web-scraper-for-a-search-engine/ Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.