Jump to content

DOMXPath error while looping with While for site scrape


max_maggot
Go to solution Solved by Ch0cu3r,

Recommended Posts

Hi all,

I I am writing a script to scrape a website. The client wants extra details captured from the website so I've had to include Xquery as well as  html DOM to retrieve these values. I'm getting an error when the while loop moves from 0 to 1 which is displayed below. I know there seems like a lot of code but I believe the error is in the creation/destruction of the DOMXpath. The rest of the code is for the scrape. It moves through a list of product webpages for a category of products and scrapes the appropriate data then moves on to the next category and the product pages associated with it.

 

 

The program dies on the line $xpath = new DOMXPath($html_x); whne the i variable reaches 1 (second time running through the loop). If you need the whole source code, I can PM you. This has been driving me crazy all day.

 

Thanks for any help you can provide. It is very much appreciated.

function get_product_details($product_link_list)
{
    //Open CSV file
    //a+ opens the file for writing and placing the pointer at the end of the file to append new data
    //If the file does not exist a+ will try to create the file products.csv
    //Appending the data to this file happens later in this function.

    //looping variable
    $i = 0;
    global $file_handle;
    $html = new DOMDocument();
    $html_x= new DOMXPath();
    //Load DOM of product page
    //@$html->loadHtmlFile($category_sub_page_list[$i]);

    while ($i <= count($product_link_list)) { //loop through each of the product details pages and scrape data {

        $html->loadHTMLFile($product_link_list[$i]);
        $xpath = new DOMXPath($html_x);


        $csv_details = "";
        $html = file_get_html($product_link_list[$i]);
        $items = array();

        //TODO: title is wrong, finish scraped values, Fix up headings at top of code.
        foreach ($html->find('div.main') as $article) {

            //capture content from website
            $item['title'] = $article->find('li.product', 0)->plaintext;
            $item['sku'] = $article->find('div.sku-no', 0)->plaintext;
            $item['price'] = $article->find('div.price-box', 0)->plaintext;
            //Capture HTML code and content for description
            $item['description'] = $article->find('div.std', 0)->outertext;
            $items['categories'] = $xpath->query("//a[@class='in-category']")->item(0)->textContent;
            $items['image'] = $xpath->query("//img/@src")->item(0)->textContent; //Get and Set Product Image
            $items['thumbnail_label'] = $xpath->query("//div[@class='highslide-caption']")->item(0)->textContent;
            $items['media_image'] = $xpath->query("//div[@class='highslide-gallery']//a/@href")->item(0)->textContent;
            $items['meta_description'] = $xpath->query("//meta[@name='description']/@content")->item(0)->textContent;
            $items['meta_keyword'] = $xpath->query("//meta[@name='keywords']/@content")->item(0)->textContent; //Get and Set meta keywords
            $item['name'] = $article->find('li.product', 0)->plaintext;
            $items['short_description'] = $xpath->query("//div[@class='short-description']")->item(0)->textContent; //Get and Set short description
            $items['small_image'] = $xpath->query("//div[@class='highslide-gallery']//a/@href")->item(0)->textContent; //Get and Set small image
            $items['thumbnail_label'] = $xpath->query("//div[@class='highslide-caption']")->item(0)->textContent; //Get and Set thumbnail label
            $items['weight'] = $xpath->query("//td[@class='et7']/text()")->item(0)->textContent;

            //Trim and fix values
            $item['title'] = str_replace(' ', '', $item['title']);
            $item['sku'] = str_replace('SKU:', '', $item['sku']);
            $item['price'] = str_replace(' ', '', $item['price']);
            $item['name'] = str_replace(' ', '', $item['price']);
            //Remove HTML code from the scraped data.
            $items['short_description'] = trim(preg_replace('/\s\s+/', ' ', $items['short_description']));
            $items['thumbnail_label'] = trim(preg_replace('/\s\s+/', ' ', $items['thumbnail_label']));

            //Assign values to temporary variables for writing to SCV
            $sku = $item['sku'];                                    //Scrape
            $title = $item['title'];                                //Scrape
            $_store = " ";                                          //Default value
            $_attribute_set = "Default";                            //Default value
            $_type = "Simple";                                      //Default value
            $_category = $items['categories'];                       //Scrape
            $_root_category = "Default Category";                   //Default value
            $_product_websites = "base";                            //Default value
            $color = " ";                                           //Default value
            $cost = $item['price'];                                 //Scrape
            $country_of_manufacture = " ";                          //Set Value - No tag to get this
            $created_at = " ";                                      //Default value
            $custom_design = "99";                                  //Default value
            $custom_design_from = "1";                              //Default value
            $custom_design_to = " ";                                //Default value
            $custom_layout_update = " ";                            //Default Value
            $description = $item['description'];                    //Scrape
            $gallery = " ";                                         //Default Value
            $gift_message_available = " ";                          //Default Value
            $has_options = "1";                                     //Default value
            $image = $items['image'];                               //Scrape
            $image_label = $items['thumbnail_label'];                 //Scrape
            $manufacturer = " ";                                    //Default - cannot scrape
            $media_gallery = $items['_media_image'];                  //Scrape
            $meta_description =  $items['meta_description'];          //Scrape
            $meta_keyword = $items['meta_keyword'];                   //Scrape
            $meta_title = $item['title'];                           //Scrape
            $minimal_price = " ";                                   //Default Value
            $msrp = "0";                                            //Default value
            $msrp_display_actual_price_type = "100";                //Default value
            $msrp_enabled = "0";                                    //Default value
            $name = $item['name'];                                   //Scrape
            $news_from_date = " ";                                  //Default value
            $news_to_date = " ";                                    //Default value
            $options_container = "1";                               //Default value
            $page_layout = " ";                                     //Default value
            $price = $item['price'];                                //Scrape
            $required_options = "0";                                //Default value
            $short_description = $items['short_description'];         //Scrape
            $small_image = $items['small_image'];                     //Scrape
            $small_image_label = $items['thumbnail_label'];           //Scrape
            $special_from_date = " ";                               //Default value
            $special_price = " ";                                   //Default value
            $special_to_date = " ";                                 //Default value
            $status = "1";                                          //Default value
            $tax_class_id = "1";                                    //Default value
            $thumbnail = "0";                                       //Default value
            $thumbnail_label = "1";                                 //Default value
            $updated_at = "0";                                      //Default value
            $url_key = "0";                                         //Default value
            $url_path = " ";                                        //Default value
            $visibility = " ";                                      //Default value
            $weight = $items['weight'];                               //Scrape and Ramon has Work to do
            $qty = "100";                                           //Default value
            $min_qty = " ";                                         //Default value
            $use_config_min_qty = " ";                              //Default value
            $is_qty_decimal = " ";                                  //Default value
            $backorders = " ";                                      //Default value
            $use_config_backorders = " ";                           //Default value
            $min_sale_qty = " ";                                    //Default value
            $use_config_min_sale_qty = " ";                         //Default value
            $max_sale_qty = " ";                                    //Default value
            $use_config_max_sale_qty = " ";                         //Default value
            $is_in_stock = "1";                                     //Default value
            $notify_stock_qty = " ";                                //Default value
            $use_config_notify_stock_qty = " ";                     //Default value
            $manage_stock = "88";                                   //Default value
            $use_config_manage_stock = "0";                         //Default value
            $stock_status_changed_auto = " ";                       //Default value
            $use_config_qty_increments = "1";                       //Default value
            $qty_increments = " ";                                  //Default value
            $use_config_enable_qty_inc = " ";                       //Default value
            $is_decimal_divided = " ";                              //Default value
            $_links_related_sku = " ";                              //Default value
            $_links_related_position = " ";                         //Default value
            $_links_crosssell_sku = " ";                            //Default value
            $_links_crosssell_position = " ";                       //Default value
            $_links_upsell_sku = " ";                               //Default value
            $_links_upsell_position = " ";                          //Default value
            $_associated_sku = "0";                                 //Default value
            $_associated_default_qty = " ";                         //Default value
            $_associated_position = "0";                            //Default value
            $_tier_price_website = " ";                             //Default value
            $_tier_price_customer_group = " ";                      //Default value
            $_tier_price_qty = " ";                                 //Default value
            $_tier_price_price = " ";                               //Default value
            $_group_price_website = " ";                            //Default value
            $_group_price_customer_group = " ";                     //Default value
            $_group_price_price = " ";                              //Default value
            $_media_attribute_id = " ";                             //Default value
            $_media_image = " ";                                    //Default value
            $_media_label = " ";                                    //Default value
            $_media_position = " ";                                 //Default value
            $_media_is_disabled = " ";                              //Default value
            $_custom_option_store = " ";                            //Default value
            $_custom_option_type = " ";                             //Default value
            $_custom_option_title = " ";                            //Default value
            $_custom_option_is_required = " ";                      //Default value
            $_custom_option_price = " ";                            //Default value
            $_custom_option_sku = " ";                              //Default value
            $_custom_option_max_characters = " ";                   //Default value
            $_custom_option_sort_order = " ";                       //Default value
            $_custom_option_row_title = " ";                        //Default value
            $_custom_option_row_price = " ";                        //Default value
            $_custom_option_row_sku = " ";                          //Default value
            $_custom_option_row_sort = " ";                         //Default value
            $enable_config_enable_qty_inc = " ";                    //Default value
            $enable_qty_inc = " ";                                  //Default value

            //Append data to CSV file
            $csv_details .= $sku . "," . $title . "," . $_store . "," . $_attribute_set . "," . $_type . "," . $_category . "," . $_root_category . "," . $_product_websites . ","
                . $color . "," . $cost . "," . $country_of_manufacture . "," . $created_at . "," . $custom_design . "," . $custom_design_from . "," . $custom_design_to . ","
                . "," . $custom_layout_update . "," . $description . "," . $gallery . "," . $gift_message_available . "," . $has_options . "," . $image . "," . $image_label . ","
                . $manufacturer . "," . $media_gallery . "," . $meta_description . "," . $meta_keyword . "," . $meta_title . "," . $minimal_price . "," . $msrp . "," .
                $msrp_display_actual_price_type . "," . $msrp_enabled . "," . $name . "," . $news_from_date . "," . $news_to_date . "," . $options_container . "," .
                $page_layout . "," . $price . "," . $required_options . "," . $short_description . "," . $small_image . "," . $small_image_label . "," . $special_from_date
                . "," . $special_price . "," . $special_to_date . "," . $status . "," . $tax_class_id . "," . $thumbnail . "," . $thumbnail_label . "," . $updated_at . "," .
                $url_key . "," . $url_path . "," . $visibility . "," . $weight . "," . $qty . "," . $min_qty . "," . $use_config_min_qty . "," . $is_qty_decimal . "," .
                $backorders . "," . $use_config_backorders . "," . $min_sale_qty . "," . $use_config_min_sale_qty . "," . $max_sale_qty . "," . $use_config_max_sale_qty
                . "," . $is_in_stock . "," . $notify_stock_qty . "," . $use_config_notify_stock_qty . "," . $manage_stock . "," . $use_config_manage_stock . "," .
                $stock_status_changed_auto . "," . $use_config_qty_increments . "," . $qty_increments . "," . $use_config_enable_qty_inc . "," . $qty_increments . "," .
                $use_config_qty_increments . "," . "," . $is_decimal_divided . "," . $_links_related_sku . "," . $_links_related_position . "," .
                $_links_crosssell_position . "," . $_links_crosssell_sku . "," . $_links_crosssell_position . "," . $_links_upsell_sku . "," . $_links_upsell_position
                . "," . $_associated_sku . "," . $_associated_default_qty . "," . $_associated_position . "," . $_tier_price_website . "," . $_tier_price_customer_group
                . "," . $_tier_price_qty . "," . $_tier_price_price . "," . $_group_price_website . "," . $_group_price_customer_group . "," . $_group_price_price . "," .
                $_media_attribute_id . "," . $_media_image . "," . $_media_label . "," . $_media_position . "," . $_media_is_disabled . "," . $_custom_option_store . "," .
                $_custom_option_type . "," . $_custom_option_title . "," . $_custom_option_is_required . "," . $_custom_option_price . "," . $_custom_option_sku . "," .
                $_custom_option_max_characters . "," . $_custom_option_sort_order . "," . $_custom_option_row_sort . "," . $_custom_option_row_title . "," .
                $_custom_option_row_price . "," . $_custom_option_row_sku . "," . $_custom_option_row_sort . "," . $enable_config_enable_qty_inc . "," . $enable_qty_inc . "\r\n";


            fwrite($file_handle, $csv_details);

            //move to next product category page
            $i++;
        }

    }
}
Link to comment
Share on other sites

OK,

Fixed initial problem. See code snippet below. The code is still crashing out after it completes its task. Any help is much appreciated.

//This function takes a all product page for a category and opens each of the pages individually
//It then scrapes all the information about that product and stores the details in a CSV file

function get_product_details($product_link_list)
{
    //Open CSV file
    //a+ opens the file for writing and placing the pointer at the end of the file to append new data
    //If the file does not exist a+ will try to create the file products.csv
    //Appending the data to this file happens later in this function.

    //looping variable
    $i = 0;
    global $file_handle;

    //Load DOM of product page
    //@$html->loadHtmlFile($category_sub_page_list[$i]);

    while ($i <= count($product_link_list)) { //loop through each of the product details pages and scrape data {
        $html = new DOMDocument();
        $html->loadHTMLFile($product_link_list[$i]);
        $xpath = new DOMXPath($html);



        $csv_details = "";
        $html = file_get_html($product_link_list[$i]);
        $items = array();

        //TODO: title is wrong, finish scraped values, Fix up headings at top of code.
        foreach ($html->find('div.main') as $article) {

            //capture content from website
            $item['title'] = $article->find('li.product', 0)->plaintext;
            $item['sku'] = $article->find('div.sku-no', 0)->plaintext;
            $item['price'] = $article->find('div.price-box', 0)->plaintext;
            //Capture HTML code and content for description
            $item['description'] = $article->find('div.std', 0)->outertext;
            $items['categories'] = $xpath->query("//a[@class='in-category']")->item(0)->textContent;
            $items['image'] = $xpath->query("//img/@src")->item(0)->textContent; //Get and Set Product Image
            $items['thumbnail_label'] = $xpath->query("//div[@class='highslide-caption']")->item(0)->textContent;
            $items['media_image'] = $xpath->query("//div[@class='highslide-gallery']//a/@href")->item(0)->textContent;
            $items['meta_description'] = $xpath->query("//meta[@name='description']/@content")->item(0)->textContent;
            $items['meta_keyword'] = $xpath->query("//meta[@name='keywords']/@content")->item(0)->textContent; //Get and Set meta keywords
            $item['name'] = $article->find('li.product', 0)->plaintext;
            $items['short_description'] = $xpath->query("//div[@class='short-description']")->item(0)->textContent; //Get and Set short description
            $items['small_image'] = $xpath->query("//div[@class='highslide-gallery']//a/@href")->item(0)->textContent; //Get and Set small image
            $items['thumbnail_label'] = $xpath->query("//div[@class='highslide-caption']")->item(0)->textContent; //Get and Set thumbnail label
            $items['weight'] = $xpath->query("//td[@class='et7']/text()")->item(0)->textContent;

            //Trim and fix values
            $item['title'] = str_replace(' ', '', $item['title']);
            $item['sku'] = str_replace('SKU:', '', $item['sku']);
            $item['price'] = str_replace(' ', '', $item['price']);
            $item['name'] = str_replace(' ', '', $item['price']);
            //Remove HTML code from the scraped data.
            $items['short_description'] = trim(preg_replace('/\s\s+/', ' ', $items['short_description']));
            $items['thumbnail_label'] = trim(preg_replace('/\s\s+/', ' ', $items['thumbnail_label']));

            //Assign values to temporary variables for writing to SCV
            $sku = $item['sku'];                                    //Scrape
            $title = $item['title'];                                //Scrape
            $_store = " ";                                          //Default value
            $_attribute_set = "Default";                            //Default value
            $_type = "Simple";                                      //Default value
            $_category = $items['categories'];                       //Scrape
            $_root_category = "Default Category";                   //Default value
            $_product_websites = "base";                            //Default value
            $color = " ";                                           //Default value
            $cost = $item['price'];                                 //Scrape
            $country_of_manufacture = " ";                          //Set Value - No tag to get this
            $created_at = " ";                                      //Default value
            $custom_design = "99";                                  //Default value
            $custom_design_from = "1";                              //Default value
            $custom_design_to = " ";                                //Default value
            $custom_layout_update = " ";                            //Default Value
            $description = $item['description'];                    //Scrape
            $gallery = " ";                                         //Default Value
            $gift_message_available = " ";                          //Default Value
            $has_options = "1";                                     //Default value
            $image = $items['image'];                               //Scrape
            $image_label = $items['thumbnail_label'];                 //Scrape
            $manufacturer = " ";                                    //Default - cannot scrape
            $media_gallery = $items['_media_image'];                  //Scrape
            $meta_description =  $items['meta_description'];          //Scrape
            $meta_keyword = $items['meta_keyword'];                   //Scrape
            $meta_title = $item['title'];                           //Scrape
            $minimal_price = " ";                                   //Default Value
            $msrp = "0";                                            //Default value
            $msrp_display_actual_price_type = "100";                //Default value
            $msrp_enabled = "0";                                    //Default value
            $name = $item['name'];                                   //Scrape
            $news_from_date = " ";                                  //Default value
            $news_to_date = " ";                                    //Default value
            $options_container = "1";                               //Default value
            $page_layout = " ";                                     //Default value
            $price = $item['price'];                                //Scrape
            $required_options = "0";                                //Default value
            $short_description = $items['short_description'];         //Scrape
            $small_image = $items['small_image'];                     //Scrape
            $small_image_label = $items['thumbnail_label'];           //Scrape
            $special_from_date = " ";                               //Default value
            $special_price = " ";                                   //Default value
            $special_to_date = " ";                                 //Default value
            $status = "1";                                          //Default value
            $tax_class_id = "1";                                    //Default value
            $thumbnail = "0";                                       //Default value
            $thumbnail_label = "1";                                 //Default value
            $updated_at = "0";                                      //Default value
            $url_key = "0";                                         //Default value
            $url_path = " ";                                        //Default value
            $visibility = " ";                                      //Default value
            $weight = $items['weight'];                               //Scrape and Ramon has Work to do
            $qty = "100";                                           //Default value
            $min_qty = " ";                                         //Default value
            $use_config_min_qty = " ";                              //Default value
            $is_qty_decimal = " ";                                  //Default value
            $backorders = " ";                                      //Default value
            $use_config_backorders = " ";                           //Default value
            $min_sale_qty = " ";                                    //Default value
            $use_config_min_sale_qty = " ";                         //Default value
            $max_sale_qty = " ";                                    //Default value
            $use_config_max_sale_qty = " ";                         //Default value
            $is_in_stock = "1";                                     //Default value
            $notify_stock_qty = " ";                                //Default value
            $use_config_notify_stock_qty = " ";                     //Default value
            $manage_stock = "88";                                   //Default value
            $use_config_manage_stock = "0";                         //Default value
            $stock_status_changed_auto = " ";                       //Default value
            $use_config_qty_increments = "1";                       //Default value
            $qty_increments = " ";                                  //Default value
            $use_config_enable_qty_inc = " ";                       //Default value
            $is_decimal_divided = " ";                              //Default value
            $_links_related_sku = " ";                              //Default value
            $_links_related_position = " ";                         //Default value
            $_links_crosssell_sku = " ";                            //Default value
            $_links_crosssell_position = " ";                       //Default value
            $_links_upsell_sku = " ";                               //Default value
            $_links_upsell_position = " ";                          //Default value
            $_associated_sku = "0";                                 //Default value
            $_associated_default_qty = " ";                         //Default value
            $_associated_position = "0";                            //Default value
            $_tier_price_website = " ";                             //Default value
            $_tier_price_customer_group = " ";                      //Default value
            $_tier_price_qty = " ";                                 //Default value
            $_tier_price_price = " ";                               //Default value
            $_group_price_website = " ";                            //Default value
            $_group_price_customer_group = " ";                     //Default value
            $_group_price_price = " ";                              //Default value
            $_media_attribute_id = " ";                             //Default value
            $_media_image = " ";                                    //Default value
            $_media_label = " ";                                    //Default value
            $_media_position = " ";                                 //Default value
            $_media_is_disabled = " ";                              //Default value
            $_custom_option_store = " ";                            //Default value
            $_custom_option_type = " ";                             //Default value
            $_custom_option_title = " ";                            //Default value
            $_custom_option_is_required = " ";                      //Default value
            $_custom_option_price = " ";                            //Default value
            $_custom_option_sku = " ";                              //Default value
            $_custom_option_max_characters = " ";                   //Default value
            $_custom_option_sort_order = " ";                       //Default value
            $_custom_option_row_title = " ";                        //Default value
            $_custom_option_row_price = " ";                        //Default value
            $_custom_option_row_sku = " ";                          //Default value
            $_custom_option_row_sort = " ";                         //Default value
            $enable_config_enable_qty_inc = " ";                    //Default value
            $enable_qty_inc = " ";                                  //Default value

            //Append data to CSV file
            $csv_details .= $sku . "," . $title . "," . $_store . "," . $_attribute_set . "," . $_type . "," . $_category . "," . $_root_category . "," . $_product_websites . ","
                . $color . "," . $cost . "," . $country_of_manufacture . "," . $created_at . "," . $custom_design . "," . $custom_design_from . "," . $custom_design_to . ","
                . "," . $custom_layout_update . "," . $description . "," . $gallery . "," . $gift_message_available . "," . $has_options . "," . $image . "," . $image_label . ","
                . $manufacturer . "," . $media_gallery . "," . $meta_description . "," . $meta_keyword . "," . $meta_title . "," . $minimal_price . "," . $msrp . "," .
                $msrp_display_actual_price_type . "," . $msrp_enabled . "," . $name . "," . $news_from_date . "," . $news_to_date . "," . $options_container . "," .
                $page_layout . "," . $price . "," . $required_options . "," . $short_description . "," . $small_image . "," . $small_image_label . "," . $special_from_date
                . "," . $special_price . "," . $special_to_date . "," . $status . "," . $tax_class_id . "," . $thumbnail . "," . $thumbnail_label . "," . $updated_at . "," .
                $url_key . "," . $url_path . "," . $visibility . "," . $weight . "," . $qty . "," . $min_qty . "," . $use_config_min_qty . "," . $is_qty_decimal . "," .
                $backorders . "," . $use_config_backorders . "," . $min_sale_qty . "," . $use_config_min_sale_qty . "," . $max_sale_qty . "," . $use_config_max_sale_qty
                . "," . $is_in_stock . "," . $notify_stock_qty . "," . $use_config_notify_stock_qty . "," . $manage_stock . "," . $use_config_manage_stock . "," .
                $stock_status_changed_auto . "," . $use_config_qty_increments . "," . $qty_increments . "," . $use_config_enable_qty_inc . "," . $qty_increments . "," .
                $use_config_qty_increments . "," . "," . $is_decimal_divided . "," . $_links_related_sku . "," . $_links_related_position . "," .
                $_links_crosssell_position . "," . $_links_crosssell_sku . "," . $_links_crosssell_position . "," . $_links_upsell_sku . "," . $_links_upsell_position
                . "," . $_associated_sku . "," . $_associated_default_qty . "," . $_associated_position . "," . $_tier_price_website . "," . $_tier_price_customer_group
                . "," . $_tier_price_qty . "," . $_tier_price_price . "," . $_group_price_website . "," . $_group_price_customer_group . "," . $_group_price_price . "," .
                $_media_attribute_id . "," . $_media_image . "," . $_media_label . "," . $_media_position . "," . $_media_is_disabled . "," . $_custom_option_store . "," .
                $_custom_option_type . "," . $_custom_option_title . "," . $_custom_option_is_required . "," . $_custom_option_price . "," . $_custom_option_sku . "," .
                $_custom_option_max_characters . "," . $_custom_option_sort_order . "," . $_custom_option_row_sort . "," . $_custom_option_row_title . "," .
                $_custom_option_row_price . "," . $_custom_option_row_sku . "," . $_custom_option_row_sort . "," . $enable_config_enable_qty_inc . "," . $enable_qty_inc . "\r\n";


            fwrite($file_handle, $csv_details);

            //move to next product category page
            $i++;
        }

    }
}
Link to comment
Share on other sites

  • Solution

If you are using simple html dom, then there is no need for the use of xpath. You use one or the other.

 

Both return the same result, just uses different syntax for the grabbing the data you require. 

 

This xpath  //div[@class=highslide-gallery]  is the equivalent in simplehtmldom as div.highslide-gallery (it uses css selectors as the dom path)

 

The reason the script maybe crashing is because it is running out of memory?  You should check your servers error logs or enable errors either in the php.ini or adding the following two lines of code at the top of your script

ini_set('display_errors', 1);
error_reporting(E_ALL);
Link to comment
Share on other sites

I agree with Ch0cu3r.

Instead of trying to scrape that site all one csv file you should save results to a database and do a systematic scrape.

 

In your case you already have a database with data, I would clone that database and add extra columns.

Make the scraper refresh each time and start at the lowest id scraping the additional data and insert into the cloned database.

Have an additional column to check for if was scraped or not in case was a glitch so don't have to start scraping from the beginning again.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.