ryanlitwiller Posted March 25, 2014 Share Posted March 25, 2014 Ok I am a complete noob!!! I have done many searches and believe that I have done quite well some what piecing together my script. So I use a forum which has a gear swap section for buying used goods problem is that if you try to search for specific goods then it searches the entire forum. So I found the simple_html_dom class that has the file_get_html method and was able to select only the titles of the listings. I had no problem displaying these listings and then populating a db. Now I want to use if and else-if statements along with regex to grab titles with specific keywords and put them in a corresponding column in my db which I have been unsuccessful at. I'd also like to eventually make my db searchable on my site I'm sure my code could be cleaned up in about every area so if anybody whats to chime in on any part of my code please feel free. It will be much appreciated I have put in many hours and I would like to know that building someone on the right footing lol. you can view my webpage at http://php-ryanlitwiller.rhcloud.com/ - in my page the titles still have their hyperlink but they try to navigate my server...any ideas of how to make them reach the original site? <?php // Open a MySQL connection $link = mysql_connect('127.6.146.130:3306', 'xxxxxxxxxx', 'xxxxxxxxxx'); if(!$link) { die('Connection failed: ' . mysql_error()); } // Select the database to work with $db = mysql_select_db('test'); if(!$db) { die('Selected database unavailable: ' . mysql_error()); } // import simple_html_dom.php to give me various methods for website selection and scraping include('simple_html_dom.php'); // get DOM from BPL URL $html = file_get_html('http://www.backpackinglight.com/cgi-bin/backpackinglight/forums/display_forum.html?forum=19'); // find all td tags with class=forum_listing foreach($html->find('td.forum_listing') as $tdTagExt) //grab just a tags foreach($tdTagExt->find('a')as $aTagExt){ //print selected outertext from previous selectors $refinedTitle = $tdTagExt->outertext; //display nobull listing of goods echo $refinedTitle . '<br>'; //find tent goods using regex to check for the word tent if(preg_match_all('/tent/', $refinedTitle)){ // add matches to corresponding sql coulom $sql = "insert into `bp` (`tent`) values ('$refinedTitle')"; $result = mysql_query($sql); //find sleeping bags using regex to check for the word bag } else if(preg_match_all('/bag/', $refinedTitle)){ $sql1 = "insert into `bp` (`bag`) values ('$refinedTitle')"; $result1 = mysql_query($sql1); //find boots using regex to check for the word boot or shoes } else if(preg_match_all('/boot|shoes/', $aTagExt->innertext)){ $sql2 = "insert into `bp` (`boot`) values ('$aTagExt->innertext')"; $result2 = mysql_query($sql2); //find clothing goods using regex to check for any of the words shirt|pants|parka|shorts|jacket } else if(preg_match_all('/shirt|pants|parka|shorts|jacket/', $aTagExt->innertext)){ $sql3 = "insert into `bp` (`clothing`) values ('$aTagExt->innertext')"; $result3 = mysql_query($sql3); } else { // Create and execute a MySQL query $sql4 = "insert into `bp` (`ahref`) values ('$aTagExt->innertext')"; $result4 = mysql_query($sql4); } } // Close the connection mysql_close($link); ?> Quote Link to comment https://forums.phpfreaks.com/topic/287248-scrape-html-then-create-db-using-regex-to-snag-keywords/ Share on other sites More sharing options...
QuickOldCar Posted March 25, 2014 Share Posted March 25, 2014 You have to fix their relative links, the links you are finding do not have the domain or subdomains in them. I don't know what your select query or variables looks like. Upon display append their domain to the front of the link for a simple fast fix . $link = "http://www.backpackinglight.com" . $link; Quote Link to comment https://forums.phpfreaks.com/topic/287248-scrape-html-then-create-db-using-regex-to-snag-keywords/#findComment-1473791 Share on other sites More sharing options...
ryanlitwiller Posted March 25, 2014 Author Share Posted March 25, 2014 Great! Thanks, ok great one problem down! Does anybody else have any other suggestions? Quote Link to comment https://forums.phpfreaks.com/topic/287248-scrape-html-then-create-db-using-regex-to-snag-keywords/#findComment-1473816 Share on other sites More sharing options...
QuickOldCar Posted March 25, 2014 Share Posted March 25, 2014 When scraping data from sites, you should get the owners permission. Not only do they frown on people mass hitting their sites or using their data....but imagine spending lots of time on something and they just block you one day. It would be a lot easier if they supplied individual feeds or an api for you to access. I can understand why you would want your own search, they use a custom google search and searches it all. I have no idea what your intentions are with the data you will get, if it's to find stuff to purchase for yourself or to better the community. If it's the latter, why not talk to them and work on building a real site search using something like sphinx or a full text search using categories/forums sections Regardless what you do, still look into it for your own projects, save all the data and perform any sort of advanced search on it. link,title,url,category as your columns in a database table, make the url unique so there isn't duplicates. The reason I say what I do is because you can have a single select query displaying the data, the values in the select statement being dynamic coming from a dropdown or search keywords. To me it's a lot easier to just add a WHERE category='$category' if any categories were selected, otherwise never add the WHERE clause and show them all results. Ask if you need more help or info on anything I said, no sense to go into more detail if it's something not interested in doing. Quote Link to comment https://forums.phpfreaks.com/topic/287248-scrape-html-then-create-db-using-regex-to-snag-keywords/#findComment-1473847 Share on other sites More sharing options...
ryanlitwiller Posted March 25, 2014 Author Share Posted March 25, 2014 When scraping data from sites, you should get the owners permission. Not only do they frown on people mass hitting their sites or using their data....but imagine spending lots of time on something and they just block you one day. It would be a lot easier if they supplied individual feeds or an api for you to access. I can understand why you would want your own search, they use a custom google search and searches it all. I have no idea what your intentions are with the data you will get, if it's to find stuff to purchase for yourself or to better the community. If it's the latter, why not talk to them and work on building a real site search using something like sphinx or a full text search using categories/forums sections Regardless what you do, still look into it for your own projects, save all the data and perform any sort of advanced search on it. link,title,url,category as your columns in a database table, make the url unique so there isn't duplicates. The reason I say what I do is because you can have a single select query displaying the data, the values in the select statement being dynamic coming from a dropdown or search keywords. To me it's a lot easier to just add a WHERE category='$category' if any categories were selected, otherwise never add the WHERE clause and show them all results. Ask if you need more help or info on anything I said, no sense to go into more detail if it's something not interested in doing. Thanks QuickOldCar for all your insight! My reasoning for this project is primary for educational purposes. I'm a senior at a local university graduating with an IT degree that specialized in networking. As I scour the job market I cant help but notice all the software engineer jobs out there and programming was a subject I always swore off in my younger days. So I have an html class and our final project is a refined site, well I was able to talk to him and he gladly allowed me to take on server side scripting, so I tried to find a project where I could utilize server side scripting and have it be something that I could actually use or be interested in. This seemed like a good solution, as I've always got on this site and thought man there are some great deals on here but there has got a to be a better way. Really I just want a project that will give me the ability to challenge myself and to implement and learn as many concepts about php as possible. I thought once I can figure out how to properly populate my db and start running queries then maybe I could even set up email alerts for specific items I'm in need of. Also if I can create something that would make this section more helpful to other users I would gladly hand over my project over. As I read your section about setting up the db it is becoming a little more clear to me of how my concept of the db is wrong. I am having a bit of trouble manipulating all the data from the file_get_html, when I print back the content it seems like a complex structure of arrays, which I'm still trying to understand arrays a little more clearly. Also QuickOldCar I am interested in anything you want to inform me on! Thanks Again! Quote Link to comment https://forums.phpfreaks.com/topic/287248-scrape-html-then-create-db-using-regex-to-snag-keywords/#findComment-1473864 Share on other sites More sharing options...
QuickOldCar Posted March 25, 2014 Share Posted March 25, 2014 Have a look at how i seperated the href links from the title and appended the sites domain. foreach($html->find('td.forum_listing') as $tdTagExt){ foreach($tdTagExt->find('a')as $aTagExt){ $title = $aTagExt->plaintext; $href = "http://www.backpackinglight.com".$aTagExt->href; echo "<a href='$href' target='_blank'>$title</a><br />"; } } Quote Link to comment https://forums.phpfreaks.com/topic/287248-scrape-html-then-create-db-using-regex-to-snag-keywords/#findComment-1473879 Share on other sites More sharing options...
Solution QuickOldCar Posted March 25, 2014 Solution Share Posted March 25, 2014 Now lets look into making this a new array, and will also add preg_match to it for categories // find all td tags with class=forum_listing foreach($html->find('td.forum_listing') as $tdTagExt){ foreach($tdTagExt->find('a')as $aTagExt){ $title = $aTagExt->plaintext; $href = "http://www.backpackinglight.com".$aTagExt->href; //associating the link and title in a new array $links[] = array("title" => "$title","href" => "$href"); //echo "<a href='$href' target='_blank'>$title</a><br />"; } } //access the new array foreach($links as $link){ $category = "uncategorized"; if(preg_match("~tent~i", $link['title'])){ $category = "tent"; } if(preg_match("~bag~i", $link['title'])){ $category = "bag"; } if(preg_match("~shoe|footwear|boot|sneaker~i", $link['title'])){ $category = "shoe"; } if(preg_match("~shirt|pants|parka|shorts|jacket~i", $link['title'])){ $category = "clothing"; } echo "$category : <a href='".$link['href']."' target='_blank'>".$link['title']."</a><br />"; } I ran this and here is what the results looked like: uncategorized : FS or FT: Salomon XA 3D Ultra 2 Trail Runners (size 10.5)uncategorized : FS: Exped Downmat 7 Short (47")uncategorized : Wtbuncategorized : WTB >8oz Packuncategorized : WTB Cuben Bivyuncategorized : Six moon designs, terra nova, jacks are better, borah gear, and morebag : FS: Montbell #3 & #5 bagsbag : WTB 15 Degree Sleeping Bagbag : FS: Clearing house - bags, bivy, tarp, daypack, tentuncategorized : 22oz. 20 degree therm-a-rest sleep system.uncategorized : FS: Fox River Gripper Gloves Largebag : FS: Mont Bell UL Spiral Down Hugger Sleeping Bag #1 (15 degree) 6 ft. lengthuncategorized : FS: Zpacks Hexamid Long Tarptent : WTB: Tarptent Contrailuncategorized : Rota Lacura Clarkii Rod (now w/ line and flies), Petzl Tikka XP 2 w/ Coreuncategorized : WTB: Katabatic Palisade or Zpacks quiltuncategorized : FS Cold Cold World Ozone Pack $55uncategorized : WTB: Deuter ACT Zero 45 + 15L SL or ACT Lite 40 + 10L SLuncategorized : FS: La Sportiva Quantum Trail Size 11.5-12uncategorized : FS: 2013 EE RevX 30F quilt w/ 1oz overstuff, excellent conditionclothing : WTB Golite Bitterroot or Selkirk Jacket; sub 6oz bivy; GG LT4 polesuncategorized : FS: Nemo Obi 2Puncategorized : FS: Patagonia Retro-X Windproof Fleece, Men's L, 2012uncategorized : FS - Lightheart Gear Solo Cubentent : WTB: Tarptent Rainshadow 2 as you can see it would be insane to try and categorize every possible word they can use in their titles. Are better off not trying to categorize them, save your data and let your search do the work. Quote Link to comment https://forums.phpfreaks.com/topic/287248-scrape-html-then-create-db-using-regex-to-snag-keywords/#findComment-1473881 Share on other sites More sharing options...
ryanlitwiller Posted March 26, 2014 Author Share Posted March 26, 2014 WOW! I cant thank you enough QuickOldCar...so I see that I was using preg_match incorrectly? I see you used if(preg_match("~tent~i", $link['title'])){ Also why is it that you simply used multiple if statements? Is it because you are doing it within a foreach? Quote Link to comment https://forums.phpfreaks.com/topic/287248-scrape-html-then-create-db-using-regex-to-snag-keywords/#findComment-1473893 Share on other sites More sharing options...
ryanlitwiller Posted March 26, 2014 Author Share Posted March 26, 2014 also $links[] = array("title" => "$title","href" => "$href"); So when you set up this array using variables title and href is that comming from simple_html_dom? Are there others? Also any recommendations for literature explaining more on arrays...the book I read "PHP for Absolute Beginners" does not get into much detail Quote Link to comment https://forums.phpfreaks.com/topic/287248-scrape-html-then-create-db-using-regex-to-snag-keywords/#findComment-1473896 Share on other sites More sharing options...
ryanlitwiller Posted March 26, 2014 Author Share Posted March 26, 2014 So when you set up this array using variables title and href is that comming from simple_html_dom? Are there others? Nevermind I see that you established these variables above $title = $aTagExt->plaintext; $href = "http://www.backpackinglight.com".$aTagExt->href; Quote Link to comment https://forums.phpfreaks.com/topic/287248-scrape-html-then-create-db-using-regex-to-snag-keywords/#findComment-1473898 Share on other sites More sharing options...
QuickOldCar Posted March 26, 2014 Share Posted March 26, 2014 Sometimes the best place to learn something is from the source itself http://www.php.net/manual/en/language.types.array.php I used preg_match because I was looking at a single item in the loop preg_match_all is useful when you want to find multiple instances of something and find all possible matches Quote Link to comment https://forums.phpfreaks.com/topic/287248-scrape-html-then-create-db-using-regex-to-snag-keywords/#findComment-1473903 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.