silkfire Posted February 13, 2011 Share Posted February 13, 2011 Hello everyone I'm very new here but i hope you could help me with a tricky problem I no longer know how to approach it because it's difficult to visalize the solution. Anyway, I have a script that goes to the root of a site (with cURL) and picks up categories (links on the site) via regex. All the links are placed into the big array I have. The first layer (dimension) I've managed to create but the problem comes to when I need my script to delve into deeper dimensions. I want, for each link it finds, go to that page and find those subcategories and place it in my array in the correct subarray. If the regex returns 0 matches, go up one step and go to the next node's site, until the whole big array has been exhausted. Is this possible? Please help out guys and gals. I'll provide more info and code if requested. Link to comment https://forums.phpfreaks.com/topic/227573-recursively-fill-array-by-scraping/ Share on other sites More sharing options...
stijnvb Posted February 13, 2011 Share Posted February 13, 2011 Wouldn't it be easier to dump all these contents into a mysql table and reference each to the parent's ID? This way you could regenerate a nice tree, and store the scraped contents for later use Link to comment https://forums.phpfreaks.com/topic/227573-recursively-fill-array-by-scraping/#findComment-1173828 Share on other sites More sharing options...
silkfire Posted February 13, 2011 Author Share Posted February 13, 2011 You're right, I was planning on putting everything into the database but I wanted to first create the multidimensional array and then to loop thru it and create database entries. But the problem persists, how do I do this? How do I go deeper into a category's tree, then return one step up if it can't find more categories until the whole "array" is finished? Link to comment https://forums.phpfreaks.com/topic/227573-recursively-fill-array-by-scraping/#findComment-1173847 Share on other sites More sharing options...
stijnvb Posted February 14, 2011 Share Posted February 14, 2011 I honestly don't really see the point of first putting all the content in the array and afterwards inserting it into the DB. it's like doing the same thing twice. Just make a simple table in mysql (probably only 4 columns required: id, source/url, parent_id, content) and insert a new row for each page your bot visits. This will give tons of possibilities of visualising it (right away, or later on) as well as processing the data. If you want to go with the array plan, I'm not sure how it's done, and that's why I'd personally go with the DB solution ;-) Link to comment https://forums.phpfreaks.com/topic/227573-recursively-fill-array-by-scraping/#findComment-1173855 Share on other sites More sharing options...
silkfire Posted February 14, 2011 Author Share Posted February 14, 2011 I want a DB solution, but without data it will be useless... The database I can handle, it's the looping that makes me confused. Let's say I'm in the root of this page. Here my crawler discovers 5 catagories (links). Then I foreach these links. And can give each original link new "children". But how do i continue generating subcategories it's the algorithm i need. Link to comment https://forums.phpfreaks.com/topic/227573-recursively-fill-array-by-scraping/#findComment-1173876 Share on other sites More sharing options...
monkover Posted February 14, 2011 Share Posted February 14, 2011 just link your crawler to a db let him write all the links from on page in it. once it has crawled one page it reads the links from that db and and crwls those links... Link to comment https://forums.phpfreaks.com/topic/227573-recursively-fill-array-by-scraping/#findComment-1173880 Share on other sites More sharing options...
stijnvb Posted February 14, 2011 Share Posted February 14, 2011 I want a DB solution, but without data it will be useless... The database I can handle, it's the looping that makes me confused. Let's say I'm in the root of this page. Here my crawler discovers 5 catagories (links). Then I foreach these links. And can give each original link new "children". But how do i continue generating subcategories it's the algorithm i need. In your DB you just add a field with parent_id. This refers to the row id of the page which linked to this page. This way you can make as many subcategories as you want. You could either check if a page already exists in the DB (cross links) and then ignore that page, or you could store those cross links as well, depending on what you're planning to do with the information you're gathering. Link to comment https://forums.phpfreaks.com/topic/227573-recursively-fill-array-by-scraping/#findComment-1173884 Share on other sites More sharing options...
silkfire Posted February 14, 2011 Author Share Posted February 14, 2011 Could you produce some code I could work with? Link to comment https://forums.phpfreaks.com/topic/227573-recursively-fill-array-by-scraping/#findComment-1173911 Share on other sites More sharing options...
monkover Posted February 14, 2011 Share Posted February 14, 2011 $sql = "CREATE TABLE table_name ( ID int NOT NULL AUTO_INCREMENT, PRIMARY KEY(ID), //add the others here )"; Link to comment https://forums.phpfreaks.com/topic/227573-recursively-fill-array-by-scraping/#findComment-1173914 Share on other sites More sharing options...
silkfire Posted February 14, 2011 Author Share Posted February 14, 2011 No, no, no...I'm not a noob or something. Some code that would let me recursively add new subcategories until there are no more left. Anyone who really understands my problem here? Link to comment https://forums.phpfreaks.com/topic/227573-recursively-fill-array-by-scraping/#findComment-1173915 Share on other sites More sharing options...
monkover Posted February 14, 2011 Share Posted February 14, 2011 how do you mean add them until there are no more left? the way i understand it is following: u list all link of page1 in a database (db1) which has a running id. then you create another subdbs for each of the links you listed. and then another one for each of those. Link to comment https://forums.phpfreaks.com/topic/227573-recursively-fill-array-by-scraping/#findComment-1173917 Share on other sites More sharing options...
stijnvb Posted February 14, 2011 Share Posted February 14, 2011 No, no, no...I'm not a noob or something. Some code that would let me recursively add new subcategories until there are no more left. Anyone who really understands my problem here? I really don't see your problem ... It's easy as 1..2..3, but I guess I'm missing something Link to comment https://forums.phpfreaks.com/topic/227573-recursively-fill-array-by-scraping/#findComment-1173955 Share on other sites More sharing options...
silkfire Posted February 14, 2011 Author Share Posted February 14, 2011 Okay, let me explain. It's a page similar to Wikipedia's category pages with images. If the regex can't find any links (images will always be found) it should go back one step and continue with the next child. Imagine a tree and when the branch ends, go up one step, take next child until that branch has ended, go up, next etc until last element is reached. I tried a while (preg_match_all(...)) but when preg match is false, it should go up one step, not stop crawling. Am I crazy or something? =) This crawler will create a structure for me, like a sitemap or something. When I click a node in this sitemap the images to that category will show. I wanted a solution that both indexes images and categories but seprated them to just get me a tree to start with but I guess that's impossible =/ Link to comment https://forums.phpfreaks.com/topic/227573-recursively-fill-array-by-scraping/#findComment-1173979 Share on other sites More sharing options...
monkover Posted February 14, 2011 Share Posted February 14, 2011 why do you want the crawler to go back... just let him crawl every link, like every crawler does... Link to comment https://forums.phpfreaks.com/topic/227573-recursively-fill-array-by-scraping/#findComment-1174011 Share on other sites More sharing options...
silkfire Posted February 14, 2011 Author Share Posted February 14, 2011 Because then it would only walk 1 branch! See it as a family tree, if a family has 3 kids and they get some kids, then it would only walk the branch of 1 ancestor kid until it reached the end, I want it to get the "grandchildren" of the 2 other "kids", do you kinda get me? Link to comment https://forums.phpfreaks.com/topic/227573-recursively-fill-array-by-scraping/#findComment-1174016 Share on other sites More sharing options...
monkover Posted February 14, 2011 Share Posted February 14, 2011 uhm no... just make a database that contains all links. let the programme mark every link it has read. so it will crawl all links... Link to comment https://forums.phpfreaks.com/topic/227573-recursively-fill-array-by-scraping/#findComment-1174191 Share on other sites More sharing options...
silkfire Posted February 14, 2011 Author Share Posted February 14, 2011 How will it know who's the child of who? You don't really think, man... Link to comment https://forums.phpfreaks.com/topic/227573-recursively-fill-array-by-scraping/#findComment-1174280 Share on other sites More sharing options...
monkover Posted February 14, 2011 Share Posted February 14, 2011 lol yes i do... but maybe u want to think a bit. obv it woeks like that bc ive made a crawler that way. Link to comment https://forums.phpfreaks.com/topic/227573-recursively-fill-array-by-scraping/#findComment-1174281 Share on other sites More sharing options...
silkfire Posted February 14, 2011 Author Share Posted February 14, 2011 Care to share some code and logic, please? Link to comment https://forums.phpfreaks.com/topic/227573-recursively-fill-array-by-scraping/#findComment-1174288 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.