Jump to content

twittoris

Members
  • Posts

    87
  • Joined

  • Last visited

    Never

Everything posted by twittoris

  1. I dont think that worked. can you explain it to me?
  2. I have a DOMDocument which I inspect an HTML page for links (href) That part works however I am not sure how to use preg_match if I only want to display a link in this layout: <tr> <td headers="c1"><a title="Link to entity information." tabindex="1" href="CORPSEARCH.ENTITY_INFORMATION?p_nameid=3236937&p_corpid=3227476&p_entity_name=%41%72%77%65%6E%20%45%71%75%69%74%69%65%73&p_name_type=%41&p_search_type=%42%45%47%49%4E%53&p_srch_results_page=0">ABC LLC</a></td> </tr> I hope someone is really good at preg_match to help me. Thanks.
  3. Here I have edited it a little and put the script online but it is still spitting out every link on the page. http://empirebuildingsestate.com/table.php I just want to grab any link similar to this layout only. CORPSEARCH.ENTITY_INFORMATION?p_nameid=3236937&p_corpid=3227476&p_entity_name=%41%72%77%65%6E%20%45%71%75%69%74%69%65%73&p_name_type=%41&p_search_type=%42%45%47%49%4E%53&p_srch_results_page=0 $dom = new DOMDocument(); @$dom->loadHTML($html); // grab all the on the page $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//a"); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); preg_match_all(nameid,$url); $sql="INSERT INTO links(cid, nlink)VALUES('$i','$url')"; $result=mysql_query($sql); echo $result; echo $url; // if successfully insert data into database, displays message "Successful". if($result){ echo "Successful"; echo "<BR>"; } else { echo "ERROR"; } echo "<br />Link stored: $url"; } ?>
  4. What if I implement preg_match somewhere in the code will it pull the urls containing it?
  5. I am trying to take a specific link from my site and place it into my database. I only want links starts with CORPSEARCH.ENTITY_INFORMATION?p_nameid= Can someone point me in the right direction here? Code for this is below: // make the cURL request to $target_url $html= curl_exec($ch); if (!$html) { echo "<br />cURL error number:" .curl_errno($ch); echo "<br />cURL error:" . curl_error($ch); exit; } // parse the html into a DOMDocument $dom = new DOMDocument(); @$dom->loadHTML($html); // grab all the on the page $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//a"); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); $sql="INSERT INTO links(cid, nlink)VALUES('$i','$url')"; $result=mysql_query($sql); echo $result; echo $url;
  6. Its a public database and there is a 500 search result limit. i think it has to do with the robot file.
  7. I have a list of companies I have to look up information for on the entity search page for the state. here is what i have so far except there seems to be a verification error. Anyone know how to fix this so the page will display results? <?php // INIT CURL $ch = curl_init(); // SET URL FOR THE POST FORM LOGIN curl_setopt($ch, CURLOPT_URL, 'http://appext9.dos.state.ny.us/corp_public/CORPSEARCH.SELECT_ENTITY'); // ENABLE HTTP POST curl_setopt ($ch, CURLOPT_POST, 1); //set curl options curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2. Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729)"); curl_setopt($ch, CURLOPT_REFERER, "http://www.google.com/"); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_COOKIEJAR, 'CURLCOOKIE'); curl_setopt($ch, CURLOPT_HEADER, true); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0); //visit page to get cookies //$strGet_page_contents = curl_exec ($ch); // SET POST PARAMETERS : FORM VALUES FOR EACH FIELD curl_setopt ($ch, CURLOPT_POSTFIELDS, 'p_entity_name=Apple LLC&p_name_type=Arwen&p_search_type=BEGINS&submit=Search!'); // EXECUTE $store = curl_exec ($ch); echo curl_exec ($ch); // CLOSE CURL curl_close ($ch); ?>
  8. How do i make a database for this: $query = mysql_query("select url from links where visited != 1); if($query) { while($query = mysql_fetch_array($result)) { $target_url = $query['url']; $userAgent = 'ScraperBot';
  9. Yeah I know. I dont know what that means. I think it is a validation error. Or maybe my form names are wrong but i dont know how to figure out the hidden form values from the BBL.js file it is passed on too.
  10. $post_fields = array( 'g.hid_borough.value' => $_POST['1'], 'g.hid_block.value ' => $_POST['995'], 'g.hid_lot.value ' => $_POST['1'], ); $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, 'http://a836-acris.nyc.gov/Scripts/DocSearch.dll/BBLResult'); // set the remote url curl_setopt($ch, CURLOPT_POST, 1); // yes we are posting curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields); // this is our POST data curl_setopt($ch, CURLOPT_HEADER, 0); // no headers in output curl_setopt($ch, CURLOPT_VERBOSE, 1); // verbose output, good for debugging curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);// $ch will return the results of your POST when you execute curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"); // the remote sites often check for a known user agent $result = curl_exec ($ch); curl_close($ch); echo($ch); print $result ?>
  11. Now I changed to cURL and recieve HTTP 405 - Resource not allowed Anyone?
  12. Forgot to attach the code: <?php $URL = 'http://a836-acris.nyc.gov/Scripts/DocSearch.dll/BBL'; $postdata = http_build_query( array( 'hid_borough' => '1', 'hid_block' => '995', 'hid_lot' => '1', ) ); $opts = array('http' => array( 'method' => 'POST', 'header' => 'Content-type: application/x-www-form-urlencoded', 'content' => $postdata ) ); $context = stream_context_create($opts); $result = file_get_contents('http://a836-acris.nyc.gov/docsearch.dll/BBL', false, $context); ?>
  13. I am trying to grab the contents from a form result. However, I keep getting the following message. Warning: file_get_contents(http://a836-acris.nyc.gov/docsearch.dll/BBL) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 405 Method not allowed in /home/content/e/m/p/empireestate/html/acris.php on line 23 This is my first draft so i know its not working but was wondering if someone can give me a few pointers.
  14. Here is my index.php file <?php require_once('Smarty.class.php'); require_once('config.php'); if(!is_writable('smarty_cache') && !is_writable('smarty_compile')) { if(function_exists('chmod')) { if(!(chmod('smarty_cache', 0777) && chmod('smarty_compile', 0777))) print 'The directories <code>smarty_cache</code> and <code>smarty_compile</code> are not writable and this problem could not be fixed automatically. Please chmod these directories to 755.'; } else print 'The directories <code>smarty_cache</code> and <code>smarty_compile</code> are not writable and this problem could not be fixed automatically. Please chmod these directories to 755.'; } require_once('rss_fetch.inc'); $smarty = new Smarty; $smarty->assign('site_name', $config['site_name']); $smarty->assign('adsense_id', $config['adsense_id']); $smarty->assign('disable_tooltips', $config['disable_tooltips']); $smarty->assign('disable_search', $config['disable_search']); $smarty->assign('links_new_window', $config['links_new_window']); $smarty->template_dir = 'template'; $smarty->compile_dir = 'smarty_compile'; $smarty->cache_dir = 'smarty_cache'; $feeds = file('feeds.txt'); $data = array(); error_reporting(0); foreach($feeds as $feed) { $i = 0; $title = ''; if(preg_match('/^([^ ]*) (.*)$/', $feed, $m)) { $feed = $m[1]; $title = $m[2]; } else $feed = trim($feed); $fetcher = fetch_rss($feed); if($fetcher->items) { array_push($data, array('title' => ($title?$title:$fetcher->channel['title']))); $data[(count($data)-1)]['link'] = $fetcher->channel['link']; $data[(count($data)-1)]['links'] = array(); foreach($fetcher->items as $item) { if($i++ < $config['items_per_feed']) array_push($data[(count($data)-1)]['links'], array('title' => ($config['max_headline_length'] && strlen($item['title']) > $config['max_headline_length'])?(substr($item['title'], 0, $config['max_headline_length']-3).'...')$item['title']), 'link' => $item['link'], 'desc' => preg_replace('/<([^>]*)>/', '',$item['summary']?$item['summary']$item['atom_content']?$item['atom_content']:$item['description'])), 'id' => ($id++)+0)); } } } $smarty->assign('data', $data); $smarty->display('index.tpl'); ?>
  15. I have a page which formats a lists of rss feeds I have put together. Is there a way I can program a php script to read each rss feed entry and output each entry into its own html file? So I have 15 feeds being read and formatted i should have 15 html files created.
  16. Hi guys, I have hit a roadblock in my php script which scrapes websites and saves them as an html file. So Far I have created a scraper which grabs the links and content from any given website. I can save it as an html file however I wish to automate the process to go through the links from the initial scraped page. I think this can be done by searching for the last html file's name that was created and adding a number to the file name then accessing the array to scrape the additional links. I do not know how to code that part though as my knowledge of arrays is limited. Any help would be great.
  17. I have a scraper that takes links off a designated webpage and then lists the urls from the webpage. Can someone help me with the next part I want to do which is have the script grab the contents of each link and save it as an html file for each page? <form name='getPageForm' action='' method='post'> Domain (example: http://www.mysite.com/ note: currently only works with root domain name (no starting at xyz folder): <br/> <input type = 'text' name='pageName' size='50' /><br /> Number of links <input type = 'text' name='numLinks' size='2' value='50' /> (will not be exact. will return #+ whatever extra on current page iteration)<br /> <input type='submit' name='getPage' value='load' /> </form> <?php class scraper { var $linkList; // list of data scraped for current page var $rootURL; // root domain entered in from form var $maxLinks; // max links from form var $masterLinkList; // master list of links scraped /* function __construct: constructor, used to do initial property assignments, based on form input */ function __construct($rootURL,$max) { $this->rootURL = $rootURL; $this->maxLinks = $max; $this->masterLinkList[] = $this->rootURL; } // end function __construct /* function scrapePage: goal is to scrape the page content of the url passed to it and return all potential links. problem is that not all links are neatly placed inside a href tags, so using the php DOM will not always return all the links on the page. Solution so far is to assume that regardless of where the actual link resides, chances are its within quotes, so idea is to grab all things that are wrapped in quotes. */ function scrapePage($url) { $linkList = array(); $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)'; // make the cURL request to $url $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_URL,$url); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER,true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); $file = @curl_exec($ch); if (!$file) { $this->linkList = $linkList; } else { // assume everything inside quotes is a possible link preg_match_all('~[\'"]([^\'"]*)[\'"]~',$file,$links); // assign results to linkList $this->linkList = $links[1]; } } // end function getLinks /* function filterLinks: goal is to go through each item in linkList (stuff pulled from scrapePage) and try to validate it as an internal link. So far we use basename to look for a valid page extension (specified in $validPageExt). Need to add ability for user to enter in valid page extensions for the target domain. We also attempt to filter out external links by checking if element starts with 'http' and if it does, if element starts with rootURL. We also assume that if there's a space somewhere in the element that its not valid. Yes that's not 100% because you can technically have a link with spaces in it, but most people don't actually code that way, and it filters out a lot of stuff between quotes, so the benefits far outweigh the cost. */ function filterLinks() { // remove all elements that do not have valid basename extensions $validPageExt = array('htm','html','php','php4','php5','asp','aspx','cfm'); foreach ($this->linkList as $k => $v) { $v = basename($v); $v = explode('.',$v); if (!in_array($v[1],$validPageExt)) unset($this->linkList[$k]); } // end foreach linkList // remove external links, convert relatives to absolute foreach ($this->linkList as $k => $v) { // if $v starts with http... if (substr($v,0,4) == "http") { // if absolute link is not from domain, delete it if (!preg_match('~^'.rtrim($this->rootURL,'/').'~i',$v)) unset($this->linkList[$k]); } else { // if not start with http, assume it is relative, add rootURL to it $this->linkList[$k] = rtrim($this->rootURL,'/') . '/' . ltrim($this->linkList[$k],'/'); } // end else } // end foreach linkList // assume that if there's a space in there, it's not a valid link foreach ($this->linkList as $k => $v) { if (strpos($v,' ')) unset($this->linkList[$k]); } // end foreach linkList // filter out duplicates and reset keys $this->linkList = array_unique($this->linkList); $this->linkList = array_values($this->linkList); } // end function filterLinks /* function addLinksToMasterLinkList: goal here is once data is retrieved from current link and filtered, we will add the link to the master link list. Also we remove dupes from the master list and reset keys. This function can probably be put inside filterLinks (and it was initially...); I couldn't decide whether it deserved its own function or not so I ended up going for it. */ function addLinksToMasterLinkList() { // add each link to master link list foreach ($this->linkList as $v) $this->masterLinkList[] = $v; // filter out duplicates on master link list and reset keys $this->masterLinkList = array_unique($this->masterLinkList); $this->masterLinkList = array_values($this->masterLinkList); } // end function addLinksToMasterLinkList /* function getLinks: basically the main engine of this bot. Goal is to go down the master link list and call each of the other functions until we've passed max links specified. It's not coded to stop at exactly maxLinks; it's coded so if the count is less than max, it scrapes another page. So the end result will be the count before the last iteration, plus whatever is on the last page. So for example if max is 50 and so far we're at 45 links, another page gets scraped. Well if that page has 10 links on it, then the end result will be 55, not 50 links. Also, we make sure to break out of the while loop if there's no more links on the master link list to grab data from. This is for if the site only has a total of like 20 links and you set the number of links to like 100, it will break out of the loop. */ function getLinks() { // start at first element $x = 0; // while there are less links in the master link list than the max allowed... while ((count($this->masterLinkList) < $this->maxLinks)) { // break out of loop and end scraping if there are no more links on the master list if (!$this->masterLinkList[$x]) break; // scrape current page in the master link list $this->scrapePage($this->masterLinkList[$x]); // filter results from the scrape $this->filterLinks(); // add filtered results to master list $this->addLinksToMasterLinkList(); // move to next link in master link list $x++; } // end while count < max }// end function getLinks /* function dumpLinkList: simple function to dump out results. mostly a debugging thing */ function dumpLinkList () { echo "<pre>";print_r($this->masterLinkList); echo "</pre>"; } // end function dumpLinkList } //*** end class scraper // if user enters url... if ($_POST['pageName']) { // create object $scraper = new scraper($_POST['pageName'],$_POST['numLinks']); // grab links $scraper->getLinks(); // dump out results $scraper->dumpLinkList(); } // end if $_POST ?>
  18. Yes exactly. I dont know how to open the file and set each array as a target to scrape with a loop.
  19. I have created a small scraper which saves content to html and links from the scrape in an array within a text file. I want to have the script run for each link that is in the text document. the text document looks like this: Array ( [0] => id="cnn_switchEdition_intl" href="http://edition.cnn.com/?cnn_shwEDDH=1" title="CNN INTERNATIONAL" [1] => href="javascript:void(0)" onclick="showOverlay('profile_signup_overlay');return false;" title="" [2] => href="javascript:void(0)" onclick="showOverlay('profile_signin_overlay');return false;" title="" [3] => id="nav-home" class="nav-media no-border nav-on" href="/" title="Breaking News, U.S., World Weather Entertainment and Video News from CNN.com" [4] => id="nav-video" class="nav-media no-border" href="/video/" title="Video Breaking News Videos from CNN.com" [5] => id="nav-newspulse" class="nav-media" href="http://newspulse.cnn.com/" title="NewsPulse from CNN.com" [6] => id="nav-us" href="/US/" title="U.S. News Headlines Stories and Video from CNN.com" [7] => id="nav-world" href="/WORLD/" title="World News International Headlines Stories and Video from CNN.com"
  20. Yeah that does work alot better. However, I am inserting adsense automatically next to each link and if I am not mistaken the adsense robots work better on hard coded pages.
  21. Is there a way to detect the last file name created so that when I reference the pages I can make a script to automatically load the correct file when the next page button is hit on the front end page?
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.