max_maggot Posted July 22, 2015 Share Posted July 22, 2015 Hi all, I I am writing a script to scrape a website. The client wants extra details captured from the website so I've had to include Xquery as well as html DOM to retrieve these values. I'm getting an error when the while loop moves from 0 to 1 which is displayed below. I know there seems like a lot of code but I believe the error is in the creation/destruction of the DOMXpath. The rest of the code is for the scrape. It moves through a list of product webpages for a category of products and scrapes the appropriate data then moves on to the next category and the product pages associated with it. The program dies on the line $xpath = new DOMXPath($html_x); whne the i variable reaches 1 (second time running through the loop). If you need the whole source code, I can PM you. This has been driving me crazy all day. Thanks for any help you can provide. It is very much appreciated. function get_product_details($product_link_list) { //Open CSV file //a+ opens the file for writing and placing the pointer at the end of the file to append new data //If the file does not exist a+ will try to create the file products.csv //Appending the data to this file happens later in this function. //looping variable $i = 0; global $file_handle; $html = new DOMDocument(); $html_x= new DOMXPath(); //Load DOM of product page //@$html->loadHtmlFile($category_sub_page_list[$i]); while ($i <= count($product_link_list)) { //loop through each of the product details pages and scrape data { $html->loadHTMLFile($product_link_list[$i]); $xpath = new DOMXPath($html_x); $csv_details = ""; $html = file_get_html($product_link_list[$i]); $items = array(); //TODO: title is wrong, finish scraped values, Fix up headings at top of code. foreach ($html->find('div.main') as $article) { //capture content from website $item['title'] = $article->find('li.product', 0)->plaintext; $item['sku'] = $article->find('div.sku-no', 0)->plaintext; $item['price'] = $article->find('div.price-box', 0)->plaintext; //Capture HTML code and content for description $item['description'] = $article->find('div.std', 0)->outertext; $items['categories'] = $xpath->query("//a[@class='in-category']")->item(0)->textContent; $items['image'] = $xpath->query("//img/@src")->item(0)->textContent; //Get and Set Product Image $items['thumbnail_label'] = $xpath->query("//div[@class='highslide-caption']")->item(0)->textContent; $items['media_image'] = $xpath->query("//div[@class='highslide-gallery']//a/@href")->item(0)->textContent; $items['meta_description'] = $xpath->query("//meta[@name='description']/@content")->item(0)->textContent; $items['meta_keyword'] = $xpath->query("//meta[@name='keywords']/@content")->item(0)->textContent; //Get and Set meta keywords $item['name'] = $article->find('li.product', 0)->plaintext; $items['short_description'] = $xpath->query("//div[@class='short-description']")->item(0)->textContent; //Get and Set short description $items['small_image'] = $xpath->query("//div[@class='highslide-gallery']//a/@href")->item(0)->textContent; //Get and Set small image $items['thumbnail_label'] = $xpath->query("//div[@class='highslide-caption']")->item(0)->textContent; //Get and Set thumbnail label $items['weight'] = $xpath->query("//td[@class='et7']/text()")->item(0)->textContent; //Trim and fix values $item['title'] = str_replace(' ', '', $item['title']); $item['sku'] = str_replace('SKU:', '', $item['sku']); $item['price'] = str_replace(' ', '', $item['price']); $item['name'] = str_replace(' ', '', $item['price']); //Remove HTML code from the scraped data. $items['short_description'] = trim(preg_replace('/\s\s+/', ' ', $items['short_description'])); $items['thumbnail_label'] = trim(preg_replace('/\s\s+/', ' ', $items['thumbnail_label'])); //Assign values to temporary variables for writing to SCV $sku = $item['sku']; //Scrape $title = $item['title']; //Scrape $_store = " "; //Default value $_attribute_set = "Default"; //Default value $_type = "Simple"; //Default value $_category = $items['categories']; //Scrape $_root_category = "Default Category"; //Default value $_product_websites = "base"; //Default value $color = " "; //Default value $cost = $item['price']; //Scrape $country_of_manufacture = " "; //Set Value - No tag to get this $created_at = " "; //Default value $custom_design = "99"; //Default value $custom_design_from = "1"; //Default value $custom_design_to = " "; //Default value $custom_layout_update = " "; //Default Value $description = $item['description']; //Scrape $gallery = " "; //Default Value $gift_message_available = " "; //Default Value $has_options = "1"; //Default value $image = $items['image']; //Scrape $image_label = $items['thumbnail_label']; //Scrape $manufacturer = " "; //Default - cannot scrape $media_gallery = $items['_media_image']; //Scrape $meta_description = $items['meta_description']; //Scrape $meta_keyword = $items['meta_keyword']; //Scrape $meta_title = $item['title']; //Scrape $minimal_price = " "; //Default Value $msrp = "0"; //Default value $msrp_display_actual_price_type = "100"; //Default value $msrp_enabled = "0"; //Default value $name = $item['name']; //Scrape $news_from_date = " "; //Default value $news_to_date = " "; //Default value $options_container = "1"; //Default value $page_layout = " "; //Default value $price = $item['price']; //Scrape $required_options = "0"; //Default value $short_description = $items['short_description']; //Scrape $small_image = $items['small_image']; //Scrape $small_image_label = $items['thumbnail_label']; //Scrape $special_from_date = " "; //Default value $special_price = " "; //Default value $special_to_date = " "; //Default value $status = "1"; //Default value $tax_class_id = "1"; //Default value $thumbnail = "0"; //Default value $thumbnail_label = "1"; //Default value $updated_at = "0"; //Default value $url_key = "0"; //Default value $url_path = " "; //Default value $visibility = " "; //Default value $weight = $items['weight']; //Scrape and Ramon has Work to do $qty = "100"; //Default value $min_qty = " "; //Default value $use_config_min_qty = " "; //Default value $is_qty_decimal = " "; //Default value $backorders = " "; //Default value $use_config_backorders = " "; //Default value $min_sale_qty = " "; //Default value $use_config_min_sale_qty = " "; //Default value $max_sale_qty = " "; //Default value $use_config_max_sale_qty = " "; //Default value $is_in_stock = "1"; //Default value $notify_stock_qty = " "; //Default value $use_config_notify_stock_qty = " "; //Default value $manage_stock = "88"; //Default value $use_config_manage_stock = "0"; //Default value $stock_status_changed_auto = " "; //Default value $use_config_qty_increments = "1"; //Default value $qty_increments = " "; //Default value $use_config_enable_qty_inc = " "; //Default value $is_decimal_divided = " "; //Default value $_links_related_sku = " "; //Default value $_links_related_position = " "; //Default value $_links_crosssell_sku = " "; //Default value $_links_crosssell_position = " "; //Default value $_links_upsell_sku = " "; //Default value $_links_upsell_position = " "; //Default value $_associated_sku = "0"; //Default value $_associated_default_qty = " "; //Default value $_associated_position = "0"; //Default value $_tier_price_website = " "; //Default value $_tier_price_customer_group = " "; //Default value $_tier_price_qty = " "; //Default value $_tier_price_price = " "; //Default value $_group_price_website = " "; //Default value $_group_price_customer_group = " "; //Default value $_group_price_price = " "; //Default value $_media_attribute_id = " "; //Default value $_media_image = " "; //Default value $_media_label = " "; //Default value $_media_position = " "; //Default value $_media_is_disabled = " "; //Default value $_custom_option_store = " "; //Default value $_custom_option_type = " "; //Default value $_custom_option_title = " "; //Default value $_custom_option_is_required = " "; //Default value $_custom_option_price = " "; //Default value $_custom_option_sku = " "; //Default value $_custom_option_max_characters = " "; //Default value $_custom_option_sort_order = " "; //Default value $_custom_option_row_title = " "; //Default value $_custom_option_row_price = " "; //Default value $_custom_option_row_sku = " "; //Default value $_custom_option_row_sort = " "; //Default value $enable_config_enable_qty_inc = " "; //Default value $enable_qty_inc = " "; //Default value //Append data to CSV file $csv_details .= $sku . "," . $title . "," . $_store . "," . $_attribute_set . "," . $_type . "," . $_category . "," . $_root_category . "," . $_product_websites . "," . $color . "," . $cost . "," . $country_of_manufacture . "," . $created_at . "," . $custom_design . "," . $custom_design_from . "," . $custom_design_to . "," . "," . $custom_layout_update . "," . $description . "," . $gallery . "," . $gift_message_available . "," . $has_options . "," . $image . "," . $image_label . "," . $manufacturer . "," . $media_gallery . "," . $meta_description . "," . $meta_keyword . "," . $meta_title . "," . $minimal_price . "," . $msrp . "," . $msrp_display_actual_price_type . "," . $msrp_enabled . "," . $name . "," . $news_from_date . "," . $news_to_date . "," . $options_container . "," . $page_layout . "," . $price . "," . $required_options . "," . $short_description . "," . $small_image . "," . $small_image_label . "," . $special_from_date . "," . $special_price . "," . $special_to_date . "," . $status . "," . $tax_class_id . "," . $thumbnail . "," . $thumbnail_label . "," . $updated_at . "," . $url_key . "," . $url_path . "," . $visibility . "," . $weight . "," . $qty . "," . $min_qty . "," . $use_config_min_qty . "," . $is_qty_decimal . "," . $backorders . "," . $use_config_backorders . "," . $min_sale_qty . "," . $use_config_min_sale_qty . "," . $max_sale_qty . "," . $use_config_max_sale_qty . "," . $is_in_stock . "," . $notify_stock_qty . "," . $use_config_notify_stock_qty . "," . $manage_stock . "," . $use_config_manage_stock . "," . $stock_status_changed_auto . "," . $use_config_qty_increments . "," . $qty_increments . "," . $use_config_enable_qty_inc . "," . $qty_increments . "," . $use_config_qty_increments . "," . "," . $is_decimal_divided . "," . $_links_related_sku . "," . $_links_related_position . "," . $_links_crosssell_position . "," . $_links_crosssell_sku . "," . $_links_crosssell_position . "," . $_links_upsell_sku . "," . $_links_upsell_position . "," . $_associated_sku . "," . $_associated_default_qty . "," . $_associated_position . "," . $_tier_price_website . "," . $_tier_price_customer_group . "," . $_tier_price_qty . "," . $_tier_price_price . "," . $_group_price_website . "," . $_group_price_customer_group . "," . $_group_price_price . "," . $_media_attribute_id . "," . $_media_image . "," . $_media_label . "," . $_media_position . "," . $_media_is_disabled . "," . $_custom_option_store . "," . $_custom_option_type . "," . $_custom_option_title . "," . $_custom_option_is_required . "," . $_custom_option_price . "," . $_custom_option_sku . "," . $_custom_option_max_characters . "," . $_custom_option_sort_order . "," . $_custom_option_row_sort . "," . $_custom_option_row_title . "," . $_custom_option_row_price . "," . $_custom_option_row_sku . "," . $_custom_option_row_sort . "," . $enable_config_enable_qty_inc . "," . $enable_qty_inc . "\r\n"; fwrite($file_handle, $csv_details); //move to next product category page $i++; } } } Quote Link to comment Share on other sites More sharing options...
max_maggot Posted July 23, 2015 Author Share Posted July 23, 2015 OK, Fixed initial problem. See code snippet below. The code is still crashing out after it completes its task. Any help is much appreciated. //This function takes a all product page for a category and opens each of the pages individually //It then scrapes all the information about that product and stores the details in a CSV file function get_product_details($product_link_list) { //Open CSV file //a+ opens the file for writing and placing the pointer at the end of the file to append new data //If the file does not exist a+ will try to create the file products.csv //Appending the data to this file happens later in this function. //looping variable $i = 0; global $file_handle; //Load DOM of product page //@$html->loadHtmlFile($category_sub_page_list[$i]); while ($i <= count($product_link_list)) { //loop through each of the product details pages and scrape data { $html = new DOMDocument(); $html->loadHTMLFile($product_link_list[$i]); $xpath = new DOMXPath($html); $csv_details = ""; $html = file_get_html($product_link_list[$i]); $items = array(); //TODO: title is wrong, finish scraped values, Fix up headings at top of code. foreach ($html->find('div.main') as $article) { //capture content from website $item['title'] = $article->find('li.product', 0)->plaintext; $item['sku'] = $article->find('div.sku-no', 0)->plaintext; $item['price'] = $article->find('div.price-box', 0)->plaintext; //Capture HTML code and content for description $item['description'] = $article->find('div.std', 0)->outertext; $items['categories'] = $xpath->query("//a[@class='in-category']")->item(0)->textContent; $items['image'] = $xpath->query("//img/@src")->item(0)->textContent; //Get and Set Product Image $items['thumbnail_label'] = $xpath->query("//div[@class='highslide-caption']")->item(0)->textContent; $items['media_image'] = $xpath->query("//div[@class='highslide-gallery']//a/@href")->item(0)->textContent; $items['meta_description'] = $xpath->query("//meta[@name='description']/@content")->item(0)->textContent; $items['meta_keyword'] = $xpath->query("//meta[@name='keywords']/@content")->item(0)->textContent; //Get and Set meta keywords $item['name'] = $article->find('li.product', 0)->plaintext; $items['short_description'] = $xpath->query("//div[@class='short-description']")->item(0)->textContent; //Get and Set short description $items['small_image'] = $xpath->query("//div[@class='highslide-gallery']//a/@href")->item(0)->textContent; //Get and Set small image $items['thumbnail_label'] = $xpath->query("//div[@class='highslide-caption']")->item(0)->textContent; //Get and Set thumbnail label $items['weight'] = $xpath->query("//td[@class='et7']/text()")->item(0)->textContent; //Trim and fix values $item['title'] = str_replace(' ', '', $item['title']); $item['sku'] = str_replace('SKU:', '', $item['sku']); $item['price'] = str_replace(' ', '', $item['price']); $item['name'] = str_replace(' ', '', $item['price']); //Remove HTML code from the scraped data. $items['short_description'] = trim(preg_replace('/\s\s+/', ' ', $items['short_description'])); $items['thumbnail_label'] = trim(preg_replace('/\s\s+/', ' ', $items['thumbnail_label'])); //Assign values to temporary variables for writing to SCV $sku = $item['sku']; //Scrape $title = $item['title']; //Scrape $_store = " "; //Default value $_attribute_set = "Default"; //Default value $_type = "Simple"; //Default value $_category = $items['categories']; //Scrape $_root_category = "Default Category"; //Default value $_product_websites = "base"; //Default value $color = " "; //Default value $cost = $item['price']; //Scrape $country_of_manufacture = " "; //Set Value - No tag to get this $created_at = " "; //Default value $custom_design = "99"; //Default value $custom_design_from = "1"; //Default value $custom_design_to = " "; //Default value $custom_layout_update = " "; //Default Value $description = $item['description']; //Scrape $gallery = " "; //Default Value $gift_message_available = " "; //Default Value $has_options = "1"; //Default value $image = $items['image']; //Scrape $image_label = $items['thumbnail_label']; //Scrape $manufacturer = " "; //Default - cannot scrape $media_gallery = $items['_media_image']; //Scrape $meta_description = $items['meta_description']; //Scrape $meta_keyword = $items['meta_keyword']; //Scrape $meta_title = $item['title']; //Scrape $minimal_price = " "; //Default Value $msrp = "0"; //Default value $msrp_display_actual_price_type = "100"; //Default value $msrp_enabled = "0"; //Default value $name = $item['name']; //Scrape $news_from_date = " "; //Default value $news_to_date = " "; //Default value $options_container = "1"; //Default value $page_layout = " "; //Default value $price = $item['price']; //Scrape $required_options = "0"; //Default value $short_description = $items['short_description']; //Scrape $small_image = $items['small_image']; //Scrape $small_image_label = $items['thumbnail_label']; //Scrape $special_from_date = " "; //Default value $special_price = " "; //Default value $special_to_date = " "; //Default value $status = "1"; //Default value $tax_class_id = "1"; //Default value $thumbnail = "0"; //Default value $thumbnail_label = "1"; //Default value $updated_at = "0"; //Default value $url_key = "0"; //Default value $url_path = " "; //Default value $visibility = " "; //Default value $weight = $items['weight']; //Scrape and Ramon has Work to do $qty = "100"; //Default value $min_qty = " "; //Default value $use_config_min_qty = " "; //Default value $is_qty_decimal = " "; //Default value $backorders = " "; //Default value $use_config_backorders = " "; //Default value $min_sale_qty = " "; //Default value $use_config_min_sale_qty = " "; //Default value $max_sale_qty = " "; //Default value $use_config_max_sale_qty = " "; //Default value $is_in_stock = "1"; //Default value $notify_stock_qty = " "; //Default value $use_config_notify_stock_qty = " "; //Default value $manage_stock = "88"; //Default value $use_config_manage_stock = "0"; //Default value $stock_status_changed_auto = " "; //Default value $use_config_qty_increments = "1"; //Default value $qty_increments = " "; //Default value $use_config_enable_qty_inc = " "; //Default value $is_decimal_divided = " "; //Default value $_links_related_sku = " "; //Default value $_links_related_position = " "; //Default value $_links_crosssell_sku = " "; //Default value $_links_crosssell_position = " "; //Default value $_links_upsell_sku = " "; //Default value $_links_upsell_position = " "; //Default value $_associated_sku = "0"; //Default value $_associated_default_qty = " "; //Default value $_associated_position = "0"; //Default value $_tier_price_website = " "; //Default value $_tier_price_customer_group = " "; //Default value $_tier_price_qty = " "; //Default value $_tier_price_price = " "; //Default value $_group_price_website = " "; //Default value $_group_price_customer_group = " "; //Default value $_group_price_price = " "; //Default value $_media_attribute_id = " "; //Default value $_media_image = " "; //Default value $_media_label = " "; //Default value $_media_position = " "; //Default value $_media_is_disabled = " "; //Default value $_custom_option_store = " "; //Default value $_custom_option_type = " "; //Default value $_custom_option_title = " "; //Default value $_custom_option_is_required = " "; //Default value $_custom_option_price = " "; //Default value $_custom_option_sku = " "; //Default value $_custom_option_max_characters = " "; //Default value $_custom_option_sort_order = " "; //Default value $_custom_option_row_title = " "; //Default value $_custom_option_row_price = " "; //Default value $_custom_option_row_sku = " "; //Default value $_custom_option_row_sort = " "; //Default value $enable_config_enable_qty_inc = " "; //Default value $enable_qty_inc = " "; //Default value //Append data to CSV file $csv_details .= $sku . "," . $title . "," . $_store . "," . $_attribute_set . "," . $_type . "," . $_category . "," . $_root_category . "," . $_product_websites . "," . $color . "," . $cost . "," . $country_of_manufacture . "," . $created_at . "," . $custom_design . "," . $custom_design_from . "," . $custom_design_to . "," . "," . $custom_layout_update . "," . $description . "," . $gallery . "," . $gift_message_available . "," . $has_options . "," . $image . "," . $image_label . "," . $manufacturer . "," . $media_gallery . "," . $meta_description . "," . $meta_keyword . "," . $meta_title . "," . $minimal_price . "," . $msrp . "," . $msrp_display_actual_price_type . "," . $msrp_enabled . "," . $name . "," . $news_from_date . "," . $news_to_date . "," . $options_container . "," . $page_layout . "," . $price . "," . $required_options . "," . $short_description . "," . $small_image . "," . $small_image_label . "," . $special_from_date . "," . $special_price . "," . $special_to_date . "," . $status . "," . $tax_class_id . "," . $thumbnail . "," . $thumbnail_label . "," . $updated_at . "," . $url_key . "," . $url_path . "," . $visibility . "," . $weight . "," . $qty . "," . $min_qty . "," . $use_config_min_qty . "," . $is_qty_decimal . "," . $backorders . "," . $use_config_backorders . "," . $min_sale_qty . "," . $use_config_min_sale_qty . "," . $max_sale_qty . "," . $use_config_max_sale_qty . "," . $is_in_stock . "," . $notify_stock_qty . "," . $use_config_notify_stock_qty . "," . $manage_stock . "," . $use_config_manage_stock . "," . $stock_status_changed_auto . "," . $use_config_qty_increments . "," . $qty_increments . "," . $use_config_enable_qty_inc . "," . $qty_increments . "," . $use_config_qty_increments . "," . "," . $is_decimal_divided . "," . $_links_related_sku . "," . $_links_related_position . "," . $_links_crosssell_position . "," . $_links_crosssell_sku . "," . $_links_crosssell_position . "," . $_links_upsell_sku . "," . $_links_upsell_position . "," . $_associated_sku . "," . $_associated_default_qty . "," . $_associated_position . "," . $_tier_price_website . "," . $_tier_price_customer_group . "," . $_tier_price_qty . "," . $_tier_price_price . "," . $_group_price_website . "," . $_group_price_customer_group . "," . $_group_price_price . "," . $_media_attribute_id . "," . $_media_image . "," . $_media_label . "," . $_media_position . "," . $_media_is_disabled . "," . $_custom_option_store . "," . $_custom_option_type . "," . $_custom_option_title . "," . $_custom_option_is_required . "," . $_custom_option_price . "," . $_custom_option_sku . "," . $_custom_option_max_characters . "," . $_custom_option_sort_order . "," . $_custom_option_row_sort . "," . $_custom_option_row_title . "," . $_custom_option_row_price . "," . $_custom_option_row_sku . "," . $_custom_option_row_sort . "," . $enable_config_enable_qty_inc . "," . $enable_qty_inc . "\r\n"; fwrite($file_handle, $csv_details); //move to next product category page $i++; } } } Quote Link to comment Share on other sites More sharing options...
Solution Ch0cu3r Posted July 23, 2015 Solution Share Posted July 23, 2015 If you are using simple html dom, then there is no need for the use of xpath. You use one or the other. Both return the same result, just uses different syntax for the grabbing the data you require. This xpath //div[@class=highslide-gallery] is the equivalent in simplehtmldom as div.highslide-gallery (it uses css selectors as the dom path) The reason the script maybe crashing is because it is running out of memory? You should check your servers error logs or enable errors either in the php.ini or adding the following two lines of code at the top of your script ini_set('display_errors', 1); error_reporting(E_ALL); Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted July 23, 2015 Share Posted July 23, 2015 I agree with Ch0cu3r. Instead of trying to scrape that site all one csv file you should save results to a database and do a systematic scrape. In your case you already have a database with data, I would clone that database and add extra columns. Make the scraper refresh each time and start at the lowest id scraping the additional data and insert into the cloned database. Have an additional column to check for if was scraped or not in case was a glitch so don't have to start scraping from the beginning again. Quote Link to comment Share on other sites More sharing options...
max_maggot Posted July 24, 2015 Author Share Posted July 24, 2015 Thank you both for thes relies. I made the changes and ppart of the issue was the mixing of xpath and HTMLDOM. The other issue was the for loop was incorrect. I should have had < than rather than <= Thanks again. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.