Jump to content

hellonoko

Members
  • Posts

    213
  • Joined

  • Last visited

Posts posted by hellonoko

  1. I have below simple query that compares a URL to a list of URLS in a DB.

     

    If the imput is found... $rows = 1; Then the code evaluates correctly.

    However if the imput is not found mysql_num_rows(); returns nothing. Not 0 rows. Not NULL and my 'else' statement fails.

     

    So I can only seem to evaluate to TRUE not to FALSE but I need both.

     

    How do I do this properly so I can evaluate both ways?

     

    $query = mysql_query("SELECT * FROM `secondarylinks` WHERE `link` = '$last_url' && `scraped` = '0' LIMIT 1") or die(mysql_error());
    $rows = mysql_num_rows($query) or die(mysql_error());
    
    if ( $rows === 1 )
    	{
    
    	echo 'found row';
    }
    else
    {
    	echo 'not found';
    
    }
    

  2. I see but don't I want to be a bit more exact since urls could be very similar but different?

     

    Still not sure what is happening as the below code returns neither in db or not in db.

     

    //echo $last_cralwed_link = "this.bigstereo.net/wp-content/uploads/2009/03/tomorrow-wow-remix.mp3";
    echo $last_crawled_link = "thisisnotinthedb";
    
    $query = mysql_query("SELECT * FROM `primarylinks` WHERE `link` LIKE '%$last_crawled_link%' LIMIT 1") or die(mysql_error());
    
    $rows = mysql_num_rows($query) or die(mysql_error());
    //
    if ( $rows == 0)
    	{
    	echo ' no such link';
    }
    else
    {
    	echo ' link in DB!';
    }

     

     

  3. Well its crawling music blogs. The links will eventually be used to copy mp3s.

     

    I worked through my code a little deeper and found that at this line:

     

    echo $last_crawled_link = mysql_real_escape_string($last_crawled_link) or die(mysql_error());

     

    The variable just goes away. That is mysql_real_escape_string turns it from a url into empty.

     

    Any ideas?

  4. My code below compares a link to a DB full of links.

    If the link is already in the DB it display the appropriate response.

     

    As it stands both links (fisrt two lines) are in the DB.

    If test is used my code works fine. However when I use the real url I get no output. No errors. Is there something I need to be doing when I am handling URLS in and outside of my DB?

     

    Any ideas?

     

    //$last_cralwed_link = "this.bigstereo.net/wp-content/uploads/2009/03/tomorrow-wow-remix.mp3";
    $last_crawled_link = "test";
    
    $query = mysql_query("SELECT * FROM `primarylinks` WHERE `link` = '$last_crawled_link' LIMIT 1") or die(mysql_error());
    
    $rows = mysql_num_rows($query) or die(mysql_error());
    //
    if ( $rows == 0)
    	{
    	echo ' no such link';
    }
    else
    {
    	echo ' link in DB!';
    }

  5. Errors:

    Notice: Undefined variable: list_links in /home2/sharingi/public_html/scrape/url_scraperV2.php on line 21
    

    Once

     

    Notice: Undefined variable: list_links in /home2/sharingi/public_html/scrape/url_scraperV2.php on line 58
    

    Many times.

     

    MySQL server has gone away

    At end.

     

    Full code:

    <?php
    
    ini_set ("display_errors", "1");
    error_reporting(E_ALL);
    
    mysql_connect("localhost","sharingi_ian","***")or die ("Could not connect to database");
    mysql_select_db("sharingi_scrape") or die ("Could not select database");
    
    //$target_url = "http://empreintes-digitales.fr";
    //$target_url = 'http://redthreat.wordpress.com/';
    //$target_url= 'http://www.kissatlanta.com/blog/';
    //$target_url= 'http://www.empreintes-digitales.fr/';
    
    //$target_url = 'http://electrorash.com/';
    
    $target_url = 'http://this.bigstereo.net/';
    
    $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
    
    // crawl first page
    $clean_links = crawl_page( $target_url, $userAgent, $list_links);
    
    // seperates links into links that are direct mp3 links and other links.
    //
    
    foreach($clean_links as $key => $value) 
    { 
      		if( strpos( $value, ".mp3") !== FALSE) 
    	{ 
    		$mp3_links[] = $value;
      		}
    	else
    	{
    		$other_links[] = $value;
    	}
    } 
    
    $mp3_links = array_values($mp3_links); 
    $other_links = array_values($other_links); 
    
    foreach ($mp3_links as $link)       
    {
       		echo $link.'<br>';
    }
    
    echo '<br>';
    
    foreach ($other_links as $link)       
    {
       		echo $link.'<br>';
    }
    
    /////// crawls second layer of links
    
    foreach ($other_links as $link)       
    {
       		
    	$clean_links = crawl_page( $link , $userAgent, $list_links);
    
    	foreach($clean_links as $key => $value) 
    	{ 
      			if( strpos( $value, ".mp3") !== FALSE) 
    		{ 
    			$mp3_links[] = $value;
      			}
    		else
    		{
    			$other_links[] = $value;
    		}
    
    	} 
    
    	$mp3_links = array_values($mp3_links); 
    	$other_links = array_values($other_links); 
    }    
    
    foreach ($mp3_links as $link)       
    {
       		echo $link.'<br>';
    
    	if ($link != NULL)
    	{
    		$exists = mysql_query("SELECT * FROM `links` WHERE link = '".mysql_real_escape_string($link)."' LIMIT 1") or die(mysql_error());
    
    		$rows = mysql_num_rows($exists);
    
    		if ( $rows == 0)
    		{
    
    			$type = "mp3";
    
    			$query = "INSERT INTO links (`link`, `type`) VALUES ('".mysql_real_escape_string($link)."' ,'".mysql_real_escape_string($type)."' )";
        	
    			if ($result = mysql_query($query)) 
    			{
         	 			$link_count = $link_count + 1; //echo "<b>link added to db</b>";
     				//echo "<br>";
        			} 
    		} 
    	}
    }
    echo '<br>';
    
    foreach ($other_links as $link)       
    {
    	$type = "link";
    
       		echo $link.'<br>';
    	if (mysql_num_rows(mysql_query("SELECT * FROM `links` WHERE link = '$link' LIMIT 1")) == 0)
    	{
    		$query = "INSERT INTO links ( `link` , `type` ) VALUES ('$link' , '$type' )";
        	
    		if ($result = mysql_query($query)) 
    		{
         	 		$link_count = $link_count + 1; //echo "<b>link added to db</b>";
     			//echo "<br>";
        		}
    	} 
    
    }
    
    
    echo $links_count;
    
    
    function crawl_page( $target_url, $userAgent, $links)
    {
    	$ch = curl_init();
    
    	curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
    	curl_setopt($ch, CURLOPT_URL,$target_url);
    	curl_setopt($ch, CURLOPT_FAILONERROR, false);
    	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    	curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    	curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
    	curl_setopt($ch, CURLOPT_TIMEOUT, 100);
    
    	$html = curl_exec($ch);
    
    	if (!$html) 
    	{
    		echo "<br />cURL error number:" .curl_errno($ch);
    		echo "<br />cURL error:" . curl_error($ch);
    		exit;
    	}
    
    	//
    	// load scrapped data into the DOM
    	//
    
    	$dom = new DOMDocument();
    	@$dom->loadHTML($html);
    
    	//
    	// get only LINKS from the DOM with XPath
    	//
    
    	$xpath = new DOMXPath($dom);
    	$hrefs = $xpath->evaluate("/html/body//a");
    
    	//
    	// go through all the links and store to db or whatever
    	//
    
    
    	for ($i = 0; $i < $hrefs->length; $i++) 
    	{
    		$href = $hrefs->item($i);
    		$url = $href->getAttribute('href');
    
    		//if the $url does not contain the web site base address: http://www.thesite.com/ then add it onto the front
    
    		$clean_link = checkURL( $url, $target_url);
    		$clean_link = str_replace( "http://" , "" , $clean_link);
    		$clean_link = str_replace( "//" , "/" , $clean_link);
    
    		$links[] = $clean_link;
    
    		//removes empty array values
    
    		foreach($links as $key => $value) 
    		{ 
      				if($value == "") 
    			{ 
        				unset($links[$key]); 
      				} 
    		} 
    
    		$links = array_values($links); 
    	}	
    
    	return $links; 
    }
    
    
    function checkURL($url, $target_url)
    {
    
    	if ( strpos($url, ".mp3") !== FALSE )
    	{
    		if ( strpos($url , "http") === FALSE )
    		{
    			//echo 'FIXED: ';
    			$url = $target_url."/".$url;
    			//echo '<br><br>';
    
    			return $url;
    		}
    		return $url;
    	}
    
    	$pos = strpos($url , $target_url);
    
    	if ( $pos === FALSE )
    	{
    		if ( strpos($url , "http") === FALSE )
    		{
    			//echo 'FIXED: ';
    			$url = $target_url."/".$url;
    			//echo '<br><br>';
    
    			return $url;
    		}
    	}
    	else
    	{
    		//echo 'COMPLETE: '.$url;
    		//echo '<br><br>';
    
    		return $url;
    	}
    }	
    ?>

     

  6. I am receiving errors when I try to put an array of scraped URLs into a DB.

     

    Error:

    this.bigstereo.net/wp-content/uploads/2009/03/tomorrow-wow-remix.mp3
    
    Warning: mysql_num_rows(): supplied argument is not a valid MySQL result resource in /home2/sharingi/public_html/scrape/url_scraperV2.php on line 82
    this.bigstereo.net/wp-content/uploads/2009/03/01 Counterpoint 1.mp3
    
    Warning: mysql_num_rows(): supplied argument is not a valid MySQL result resource in /home2/sharingi/public_html/scrape/url_scraperV2.php on line 82
    this.bigstereo.net/wp-content/uploads/2009/03/Lips (Spruce Lee Inner Jungle Mix).mp3
    
    Warning: mysql_num_rows(): supplied argument is not a valid MySQL result resource in /home2/sharingi/public_html/scrape/url_scraperV2.php on line 82

     

    I was having similar problems with links that contained ' or " but cleaned up my query with mysql_real_escape_string() and was working perfect gathering links from another site.

     

    Can't see what the problem is with this. Any suggestions?

     

    Line 82 is:

    			$rows = mysql_num_rows($exists);
    

     

    Thanks

     

    foreach ($mp3_links as $link)       
    {
       		echo $link.'<br>';
    
    	if ($link != NULL)
    	{
    		$exists = mysql_query("SELECT * FROM `links` WHERE link = '".mysql_real_escape_string($link)."' LIMIT 1");
    
    		$rows = mysql_num_rows($exists);
    
    		if ( $rows == 0)
    		{
    
    			$type = "mp3";
    
    			$query = "INSERT INTO links (`link`, `type`) VALUES ('".mysql_real_escape_string($link)."' ,'".mysql_real_escape_string($type)."' )";
        	
    			if ($result = mysql_query($query)) 
    			{
         	 			$link_count = $link_count + 1; //echo "<b>link added to db</b>";
     				//echo "<br>";
        			} 
    		} 
    	}
    }

  7. Here:

    if ($link != NULL)
    	{
    		$exists = mysql_query("SELECT * FROM `links` WHERE link = '$link' LIMIT 1");
    
    		$rows = mysql_num_rows($exists);
    
    		if ( $rows == 0)
    		{
    
    			$type = "mp3";
    
    			$query = "INSERT INTO links (`link`, `type`) VALUES ('$link' ,'$type' )";
        	
    			if ($result = mysql_query($query)) 
    			{
         	 			$link_count = $link_count + 1; //echo "<b>link added to db</b>";
     				//echo "<br>";
        			} 
    		} 
    	}

     

    Only errors on links that contain ' and possibly "

     

     

     

  8. Well there are two instances of it but yes it is mysql_num_rows() that is giving the error.

     

    I was able to make it mostly work by cleaning up my query using ` `

     

    But now I can see on the links that it still errors with they have ' or " in the names.

     

    Examples:

     

    rednicko.com/080923/Klaxons-Gravity'sRainbow(Guns'N'BombsFreakoutRemix).mp3

     

    Warning: mysql_num_rows(): supplied argument is not a valid MySQL result resource in /home2/sharingi/public_html/scrape/url_scraperV2.php on line 76

    rednicko.com/080923/GhostfaceKiller-CharlieBrown(Guns'N'BombsRemix).mp3

     

    Warning: mysql_num_rows(): supplied argument is not a valid MySQL result resource in /home2/sharingi/public_html/scrape/url_scraperV2.php on line 76

     

  9. My below code crawls through a blog and the inserts the found links into my database.

     

    However I am receiving the following error for each time I try to insert a link:

    Warning: mysql_num_rows(): supplied argument is not a valid MySQL result resource in /home2/sharingi/public_html/scrape/url_scraperV2.php on line 74
    

     

    Line 74 compares the link to be inserted with existing rows to avoid duplicates.

     

    foreach ($mp3_links as $link)       
    {
       		echo $link.'<br>';
    
    	$query = mysql_query("SELECT * FROM links WHERE link=$link LIMIT 1");
    
    	$rows = mysql_num_rows($query);
    
    	if ( $rows == 0)
    	{
    		$query = "INSERT INTO links (link) VALUES ('$link')";
        	
    		if ($result = mysql_query($query)) 
    		{
         	 		$link_count = $link_count + 1; //echo "<b>link added to db</b>";
     			//echo "<br>";
        		} 
    	} 
    } 

     

    I am also noticing that even with this error about 1200 rows are inserted when it should be just about 600.

     

    This code worked fine in another version of the page any idea what I am doing wrong?

     

    Thanks

     

    <?php
    
    mysql_connect("localhost","sharingi_ian","*****")or die ("Could not connect to database");
    mysql_select_db("sharingi_scrape") or die ("Could not select database");
    
    //$target_url = "http://empreintes-digitales.fr";
    $target_url = 'http://redthreat.wordpress.com/';
    //$target_url= 'http://www.kissatlanta.com/blog/';
    //$target_url= 'http://www.empreintes-digitales.fr/';
    
    $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
    
    // crawl first page
    $clean_links = crawl_page( $target_url, $userAgent, $list_links);
    
    // seperates links into links that are direct mp3 links and other links.
    //
    
    foreach($clean_links as $key => $value) 
    { 
      		if( strpos( $value, ".mp3") !== FALSE) 
    	{ 
    		$mp3_links[] = $value;
      		}
    	else
    	{
    		$other_links[] = $value;
    	}
    } 
    
    $mp3_links = array_values($mp3_links); 
    $other_links = array_values($other_links); 
    
    foreach ($mp3_links as $link)       
    {
       		echo $link.'<br>';
    }
    
    echo '<br>';
    
    foreach ($other_links as $link)       
    {
       		echo $link.'<br>';
    }
    
    /////// crawls second layer of links
    
    foreach ($other_links as $link)       
    {
       		$clean_links = crawl_page( $link , $userAgent, $list_links);
    
    	foreach($clean_links as $key => $value) 
    	{ 
      			if( strpos( $value, ".mp3") !== FALSE) 
    		{ 
    			$mp3_links[] = $value;
      			}
    		else
    		{
    			$other_links[] = $value;
    		}
    	} 
    
    	$mp3_links = array_values($mp3_links); 
    	$other_links = array_values($other_links); 
    }    
    
    foreach ($mp3_links as $link)       
    {
       		echo $link.'<br>';
    
    	$query = mysql_query("SELECT * FROM links WHERE link=$link LIMIT 1");
    
    	$rows = mysql_num_rows($query);
    
    	if ( $rows == 0)
    	{
    		$query = "INSERT INTO links (link) VALUES ('$link')";
        	
    		if ($result = mysql_query($query)) 
    		{
         	 		$link_count = $link_count + 1; //echo "<b>link added to db</b>";
     			//echo "<br>";
        		} 
    	} 
    }
    
    echo '<br>';
    
    foreach ($other_links as $link)       
    {
       		echo $link.'<br>';
    	if (mysql_num_rows(mysql_query("SELECT * FROM links WHERE link=$link LIMIT 1")) == 0)
    	{
    		$query = "INSERT INTO links (link) VALUES ('$link')";
        	
    		if ($result = mysql_query($query)) 
    		{
         	 		$link_count = $link_count + 1; 
        		}
    } 
    
    }
    
    
    echo $links_count;
    
    
    function crawl_page( $target_url, $userAgent, $links)
    {
    	$ch = curl_init();
    
    	curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
    	curl_setopt($ch, CURLOPT_URL,$target_url);
    	curl_setopt($ch, CURLOPT_FAILONERROR, false);
    	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    	curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    	curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
    	curl_setopt($ch, CURLOPT_TIMEOUT, 100);
    
    	$html = curl_exec($ch);
    
    	if (!$html) 
    	{
    		echo "<br />cURL error number:" .curl_errno($ch);
    		echo "<br />cURL error:" . curl_error($ch);
    		exit;
    	}
    
    	//
    	// load scrapped data into the DOM
    	//
    
    	$dom = new DOMDocument();
    	@$dom->loadHTML($html);
    
    	//
    	// get only LINKS from the DOM with XPath
    	//
    
    	$xpath = new DOMXPath($dom);
    	$hrefs = $xpath->evaluate("/html/body//a");
    
    	//
    	// go through all the links and store to db or whatever
    	//
    
    
    	for ($i = 0; $i < $hrefs->length; $i++) 
    	{
    		$href = $hrefs->item($i);
    		$url = $href->getAttribute('href');
    
    		//if the $url does not contain the web site base address: http://www.thesite.com/ then add it onto the front
    
    		$clean_link = checkURL( $url, $target_url);
    		$clean_link = str_replace( "http://" , "" , $clean_link);
    		$clean_link = str_replace( "//" , "/" , $clean_link);
    
    		$links[] = $clean_link;
    
    		//removes empty array values
    
    		foreach($links as $key => $value) 
    		{ 
      				if($value == "") 
    			{ 
        				unset($links[$key]); 
      				} 
    		} 
    
    		$links = array_values($links); 
    	}	
    
    	return $links; 
    }
    
    
    function checkURL($url, $target_url)
    {
    
    	if ( strpos($url, ".mp3") !== FALSE )
    	{
    		if ( strpos($url , "http") === FALSE )
    		{
    			//echo 'FIXED: ';
    			$url = $target_url."/".$url;
    			//echo '<br><br>';
    
    			return $url;
    		}
    		return $url;
    	}
    
    	$pos = strpos($url , $target_url);
    
    	if ( $pos === FALSE )
    	{
    		if ( strpos($url , "http") === FALSE )
    		{
    			//echo 'FIXED: ';
    			$url = $target_url."/".$url;
    			//echo '<br><br>';
    
    			return $url;
    		}
    	}
    	else
    	{
    		//echo 'COMPLETE: '.$url;
    		//echo '<br><br>';
    
    		return $url;
    	}
    }	
    ?>

  10. My below code loops retrieved URLs into a array while displaying them.

     

    The URLs display correctly so I know my function is working however I must be storing them to the array or trying to display them wrong because that part of my code does not work.

     

    What am I doing wrong here?

     

    	for ($i = 0; $i < $hrefs->length; $i++) 
    	{
    		$href = $hrefs->item($i);
    		$url = $href->getAttribute('href');
    
    
    		echo '<b>'.$links = $clean_link = checkURL( $url, $target_url).'<b>';
    		echo '<br>';
    
    
    
    	}	
    
    	echo count($links);
    
    	foreach ($links as $link) 		
    	{
       			echo $link;
    		echo '<br>';
    	}	
    

     

     

  11. I finally got it to work but not how it should.

     

    If i use strpos( $url , "http" ); it works

     

    However if I use: strpos ($url, $target_url);

     

    its always false like you said. because its not comparing correctly.

     

    Any ideas on that one?

  12. To be more specific:

     

    function checkURL($url, $target_url)
    {
    	echo $url.'<br>';
    	echo $target_url.'<br>';
    
    	echo gettype($url).'<br>';
    	echo gettype($target_url).'<br>';
    
    	echo '<b>';
    	echo $pos = strpos($url , $target_url);
    	echo '</b>';
    
    
    }
    

     

    Returns:

    http://empreintes-digitales.fr/board/register.php
    http://www.empreintes-digitales.fr
    string
    string
    http://empreintes-digitales.fr/board/login.php?action=forget
    http://www.empreintes-digitales.fr
    string
    string
    #
    http://www.empreintes-digitales.fr
    string
    string
    http://66.102.9.104/translate_c?hl=fr&sl=fr&tl=en&u=www.empreintes-digitales.fr/index.php
    http://www.empreintes-digitales.fr
    string
    string

     

    And on and on. Nothing from

    $pos = strpos()

     

     

  13. If you look in the source you can see it actually uses a javascript call when the buy now button is clicked to do the adding to the cart.

     

    <table cellspacing="0" cellpadding="0" onclick="javascript: document.orderform_72_1220992491.submit();" class="ButtonTable">

    <tr><td><img src="/store/skin1/images/but1.gif" class="ButtonSide" alt="" /></td><td class="Button"><font class="Button">Buy Now</font></td><td><img src="/store/skin1/images/but2.gif" class="ButtonSide" alt="" /></td></tr>

    </table>

     

    I think..

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.