guymclarenza Posted February 14, 2021 Share Posted February 14, 2021 In order to solve my problem I have been told to use multi threading, In my research I have found parallel because thread is not recommended for web server environments, What is CLI? php.net has the info but I may just be a little stupid, When looking for more information, I found something called compose, which I may need to install, but I am using shared hosting so it may not be possible. I have found that many hosts are unwilling to do anything beyond the basics. 1. Can parallel be used without installing these added dependencies or packages? 2. Is there a simple to understand explanation somewhere of this? I have Googled but am just getting more and more confused. Quote Link to comment https://forums.phpfreaks.com/topic/312146-threading-parallel-how-to/ Share on other sites More sharing options...
requinix Posted February 14, 2021 Share Posted February 14, 2021 2 hours ago, guymclarenza said: In order to solve my problem I have been told to use multi threading, Hold on. What? Quote Link to comment https://forums.phpfreaks.com/topic/312146-threading-parallel-how-to/#findComment-1584448 Share on other sites More sharing options...
guymclarenza Posted February 14, 2021 Author Share Posted February 14, 2021 13 hours ago, requinix said: Hold on. What? Running processes in parallel so as to speed up the script Quote Link to comment https://forums.phpfreaks.com/topic/312146-threading-parallel-how-to/#findComment-1584464 Share on other sites More sharing options...
requinix Posted February 14, 2021 Share Posted February 14, 2021 44 minutes ago, guymclarenza said: Running processes in parallel so as to speed up the script That may or may not be multithreading, depending on what you're trying to explain. But anyway, I was more interested in why someone said that you needed "multithreading". What is being slow and why is "multithreading" supposed to help? Quote Link to comment https://forums.phpfreaks.com/topic/312146-threading-parallel-how-to/#findComment-1584468 Share on other sites More sharing options...
guymclarenza Posted February 14, 2021 Author Share Posted February 14, 2021 (edited) My crawler runs for up to 30 minutes to return 40 - 50 results on my dev machine, the moment I try and run it on the webserver it times out in a few minutes. I am looking for a way to reduce the time to less than 3 minutes. so instead of doing everything in a queue, break up queue and run concurrent queues. Instead of crawling one page at a time, crawling multiple pages simultaneously. A little knowledge is dangerous. I could see that the script I created from the tutorial was not much good, I have been trying to solve some problems, broke the script a few times, got it to do what I wanted to but it seems that it's not very efficient. My goal now is to learn how to make it faster. I have even been looking at Python to see if that may not be a better way forward. All this confusion caused by a build a search engine like Google tutorial on Udemy where I found the flaws and looking for solutions to those. The deeper I dig the more confused I get which is why I am looking for advice on finding good tutorials so that I can skip doing the shitty ones. The bloke who runs said Udemy course said I should look at muti threading, I suspect it's a case of the blind leading the blind. Edited February 14, 2021 by guymclarenza added line Quote Link to comment https://forums.phpfreaks.com/topic/312146-threading-parallel-how-to/#findComment-1584469 Share on other sites More sharing options...
requinix Posted February 14, 2021 Share Posted February 14, 2021 Yeah, no, multithreading isn't the answer. Concurrency is. Meaning you have this script that can do the crawling, and you run the script multiple times in parallel. But first, 30 minutes to get 40-50 results is absurd. Have you looked into why it's taking that long? It's ridiculous. Quote Link to comment https://forums.phpfreaks.com/topic/312146-threading-parallel-how-to/#findComment-1584471 Share on other sites More sharing options...
guymclarenza Posted February 15, 2021 Author Share Posted February 15, 2021 The problems as I see them are as follows. It crawls a page, gets links, then has to discard duplicates, I think the hold up is there. I am removing duplicates after fixing the url, maybe it would be better to strip out all duplicates before "fixing" the url. To get 50 results, it is crawling and doing the whole process on 50 pages Does this make sense. follow links. add links to array remove duplicates fix links echo links repeat at present the logic is follow links fix links remove duplicates echo links repeat Quote Link to comment https://forums.phpfreaks.com/topic/312146-threading-parallel-how-to/#findComment-1584475 Share on other sites More sharing options...
requinix Posted February 15, 2021 Share Posted February 15, 2021 Scraping a page should take maybe one second. Dealing with the database, a fraction of a second. All in all 40-50 pages should be, like, a minute. I can't believe that dealing with duplicates takes up 29 more. What's your code? Quote Link to comment https://forums.phpfreaks.com/topic/312146-threading-parallel-how-to/#findComment-1584476 Share on other sites More sharing options...
guymclarenza Posted February 15, 2021 Author Share Posted February 15, 2021 busy looking at scrapy in python Quote Link to comment https://forums.phpfreaks.com/topic/312146-threading-parallel-how-to/#findComment-1584482 Share on other sites More sharing options...
kicken Posted February 15, 2021 Share Posted February 15, 2021 For reference, the example code I posted in your other thread was able to crawl all 127 pages of my site in about 20 seconds. Quote Link to comment https://forums.phpfreaks.com/topic/312146-threading-parallel-how-to/#findComment-1584485 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.