Jump to content

TheStudent2023

Members
  • Posts

    136
  • Joined

  • Last visited

Everything posted by TheStudent2023

  1. @kicken Don't mind. But can you see if I understood the tutorial or not ? Description strlen ( string $string ) : int Q1. The data on the right of the colon means what type of data will be returned by the function. Yes ? Finally, I learnt what the colon means on function syntax explanations! The manual ads: "We could rewrite the above function definition in a generic way: function name ( parameter type parameter name ) : returned type" Yes. The manual should have shown this generic way in all their functions explanation pages as it is easier to understand! Q2. in_array ( mixed $needle, array $haystack , bool $strict = false ) : bool The tutorial does not explain what the $ mean here. These 6 yrs, when I look at a function syntax and see the $, I think they are $vars. But now, I understand that they are not. The $ indicate the word following it is the function parameter name (and not a variable name that can be inputted in the parameter). Correct ? Silly sods! They should have CAPITALISED the parameter names instead. Like so ATLEAST: in_array ( mixed NEEDLE, array HAYSTACK , bool STRICT = false ) : bool Better, had they wrote the function syntax like this: in_array ( mixed NEEDLE, array HAYSTACK , bool STRICT = false ) : bool This is where the italic represents "data type", CAPITAL represents "param name". Underline represents "default value". Now, is not this a better way to get the message through to others ? Yes. I always knew I was never an inventor but always an improver of inventions. That is my strength. I can figure-out how to simplify things. I am not a good learner. But, whatever I manage to learn with hiccups (by struggling), I can easily teach others in a better simplified way that gets them to easily learn the thing and remember it and not forget it. Once a student 21yrs ago, remarked I teach better than the teacher himself. I agree. Thanks!
  2. Thanks. I do understand you, though. It's just, back in late 2015, I started learning php from php.net. Then found it too complicated. Tried tutorial sites instead and learnt the very basics. Enrolled to php class but before long the teacher quit his job and gave his brother to take-over his position. Brother was able to teach all the other subjects, like css. But he did not know php and I found myself in a position that, I will have to teach him instead. I quit the school. And found myself dropped in the middle of the sea. Not knowing which direction to swim for (where the shore is). Put php on hold till early 2017. Stuck to it since. So, by myself at home, started learning php but not in proper order. That is why I do not know some fundamentals. Got no proper guidance. I just learn by reading tutorials here and there online and you know very well most tutorials are outdated. Found that out the hard way. Had I known pdo was new then I never would have bothered with mysqli. That is one example. Thanks for the links. Your link begins with: "Each function in the manual is documented for quick reference. Knowing how to read and understand the text will make learning PHP much easier. Rather than relying on examples or cut/paste, everyone should know how to read function definitions (prototypes)." They took words right out of my mouth! Because that is what I have been doing for 6yrs now. I not know how to read the functions syntax explanations in the manual and always rely on code examples to figure-out how the functions work.
  3. Sorry, mate. Sometimes your sentences do not make sense due to grammatical errors. Also, you do not put commas in the right places. And so, I find it hard to understand your sentences. I always understand your replies 50-75%. Never 100%. Gonna re-read your sentences a few times to see what you are trying to tell me.
  4. @requinix Ok. One question though. Had I simply quoted your previous post and not tagged you in this post, would you have got notified of the reply to you or not ? Or, would you have only seen my reply (to you) if you had made your way to this thread again ? Asking because want to know how this forum functions technically. And, which vps hosts have you found to be good and which ones nightmares ?
  5. @requinix Can you delete your post because it is obstructing me from editing my previous post. I had typos on my previous post. Edited it, added some more sentences and now it says I cannot edit the post. Post not getting edited/updated. Or, best you delete my previous post and I will post new the edit.
  6. @ginerjm Yeah, it was me who you probably blocked about 2yrs ago. Told me not to tag you too. On many other forums you were helpful at first but at the end were downright rude, short tempered. going ballistic now and then. Most of the replies you gave on my threads were RTFM. At the end I started hating you for your tantrums but never told you so because I am middle aged and no kiddo. No other user behaved like a short circuit, like you did. And you probably got Requinix to ban me here. Talking about 2yrs back when I was using another Username. 3-4yrs back, Requinix also banned me from here. That time, was also using another Username. He banned me because Benanamen went around forums asking mods to ban me simply because I asked the same questions on many forums (10-20) to see what programmers from different walks of life answered. i liked getting different flavoured answers. At the end, most programmers did not care that I asked the same questions on many forums and were still willing to help and answering my questions. I guess they are matured people. And, Benanamen was getting jealous & frustrated to that. That programmers were still answering my threads. At the end, he skulkingly resorted to pester mods to ban me. I got banned from most forums due to so-called "cross posting". No thanks to HIM. Kicken was always on my good books. Always polite, even though I always pestered or harassed him tagging him now and then to put attention to my posts. Still do. Barand answered me now and then too in the past. I had forgotten about him too. Occasinally he answers me this time round. Mac_gyver, I could never get his attentions 2yrs back and failing this time too. He always answers my threads where I ask for feed-back. But never responds to threads where I show code & errors. he acts mostly like a code feed-back giver. Requinix was helpful 2yrs back. Responded to my threads most of the time. And 3-4yrs back, he was very helpful. So originally, I signed up to this forum about 4yrs back. A yr later or so, Requinix banned me. Then, 2yrs ago, I returned back with another Username. And Requinix banned me again. Then, I only concentrated on webdevelopers.com forum for the past 2yrs since I was banned from the other forums, no thanks to Benanamen and maybe also you. Since webdevelopers.com is no longer in operation for a month now and I need answers. I returned back to this forum with another username and also returned back to sitepoints.com with another username. Missed this forum and sitepoints.com for 2yrs. So, I returned back to both a month ago. Mod Gandalf over there always recognises me, no matter how many different usernames I use when returning back to them. In the past 2yrs, returned back to them 2-4 times with different Usernames and each time got caught. Banned me each time. Probably traces my mac address. He had a hard time banning me this time (when I returned there a month ago or so. Returned the same time I returned back here). Because when he banned my username, I then logged into the forum using gmail and opened threads and got all my answers. He had a hard time to ban me when I do not login using a username but login using gmail. For some reason, they fixed this loophole and for a wk now I am unable to open threads there. But, while I was away from this forum and all others, including sitepoint.com, for 2yrs and was active at webdevelopers.com forum, I forgot about good old Kicken. When I returned back to this forum after 2yrs and he started responding to my threads, I remembered him again. I am no longer at webdevelopers.com because I think that forum is either sold or down. But you would know better because you were there too. Over there, you was a bit haughty with me but not downright rude like you were 2yrs back over here or at some other forum. But, I get the feeling, if Benanamen catches up with me again then he will or maybe you will try putting Kicken off from answering my posts and probably would get Requinix or mac_gyver ban me again. I know I am risking a ban by confessing I am a banned person here. Actually, I am currently at another forum too but I am not naming them here. I used to get good responses there for about 6 mnths approx now but now that forum has gone quiet. Not sure, if you were over there or not. But, I do remember you were at webdevelopers.com. Good old Mod NogDog, I always used to tag and pester, more than I pester Kicken here. And he always was helpful. At the end, he gave less responses. Probably got tired of me tagging him nearly everyday. But he was not rude, though. I guess he is a matured person like Kicken here. Anyway, this time, after I returned about 2 yrs later. I find you very calm and patient and responsive, even though not that helpful as some of the others and was beginning to like you. Whether I still continue to like you or no, will be based on your attitude, behaviour. If you start giving no proper response other than writing RTFM, then you will get on my bad books. If this Username stops operating from tonight, then it means Requinix or Mac_gyver banned me again. But, I can always return back with a different username, email and mach address. And I am good at disguising my writing style. But, I hope I won't get banned. Mac_gyver might fee frustrated that, he spent time & effort giving feed-back to someone he or Requinix banned 2yrs back or 4yrs back. I got banned twice before. Not sure who banned me the first time. Mac_Gyver or Requinix. Not sure which of them banned me the 2nd time too. But it was either of them on both occassions. Now, you have a serious question. Why am I getting you fine folks to review some old code. Well, you see, I learn many different ways of coding to achieve the same purpose. And then experiment which gives better results. Like faster responses. I write a piece of code, then derive more out of it writing many versions in different ways. Then experiment, play, fiddle, test, etc with them. Procrastinate. At the end, I harass the pros to choose which version to stick to and give their reasons why they chose that version over the others. Like I am doing now in my op. It is my habit to do so. Oh btw, this is my habit. I open threads. See who responses. Start liking them. And then whenever I open more threads on other days, I expect the same pros to respond. When i get no response, I start tagging them. Sometimes, I open threads and do not wait but start tagging them, if I see they not complaining. You started complaining 2yrs back and so bit by bit I stopped tagging you to stay on your good books. As for RTFM, I do not know how to read the manual syntaxes. That is why manual no use to me unless there are code examples that I managed to understand. Now, if you teach me how to read & understand the manual's function syntaxes then I should be able to stick to the manual from now on. Your choice. PS - Do you remember uniqueideaman username ? That was my original username on many forums back in 2017. I think it got banned on all forums (approx 10, no thanks to that Benanamen) around 2020 probably. That uniqueideaman was afamous username.All users in all forums liked me back then. No one really complained but Benanamen. resulting in my ban. Frankly, I can't remember which usernames I used ion this forum apart from the current one and uniqueideaman. Try goggling for uniqueideaman username and see what you get. So yes, I started learning php in feb 2017 and still am at procedural programming and mysqli and prepared statements. Even though promied programmers on many forums over the yrs, I will migrate to pdo, I still have not. I just hate the syntax. All that ":::" stuff. Does not sink into my head. As forpdodelusions. com. Read it few yrs back, probably 2-5yrs. Read it again about 3mnths ago. But I easily forget what I learnt on it. Hence, stick to what I can remember and that is mysqli_ and prepared statements and procedural programming. Attempted learning oop twice. A yr ago. And probably 3yrs ago. But I just do not understand what is an "object". Confuses me. Someone explained here what it is and I grasped little bit, this time. But, I am not gonna get into oop in php. Once my current projects are complete, I will jump to Python which I hear is easier to learn as they teach it to 12-13yr olds in UK & USA. So, should seem easy to a nearly late 40's guy. Guessing you in your late 20's or early 30's, if not late teens. So what have I been doing with php for the past 6yrs ? Google for uniqueideaman username and findout. Basically, been busy to learn how to build html forms and submit data to db and query db and present results in pagination format. Build membership/account pages like reg, login, logout, search page, etc. That is all. Anyway, Ginerjm, thanks for all your time, effort & helps.
  7. "a human ...". You talk as if you not human. Are you CharGPT ? Lol! @mac_gyver told me not include this: ini_set('display_startup_errors', 1); But did not reply when I asked him to elaborate why I should not add it. https://forums.phpfreaks.com/topic/316225-pagination-1-mysqli_stmtm_store_result-query/#comment-1607916
  8. @mac_gyver Frankly, I do not understand the function syntaxes. array_filter(array $array, ?callable $callback = null, int $mode = 0): array What is the above explaining ? If I can get the hang of understanding things like this then a lot of questions I won't need to ask. I mean, should not the syntax been like this: array_filter(array, ?callable, int): array And the example, like this: array_filter($array, ?$callback = null, $mode = 0): array But mixing the above 2 up and making following is confusing syntax: array_filter(array $array, ?callable $callback = null, int $mode = 0): array
  9. @kicken I did read the manual. Did not understand what is callback. Reading again. So, what is callback ?
  10. Ginerjm I did check But where is the $var defined ? That was my original question. What is the value of $var to begin with ?
  11. @barand Thanks a lot. I appreciate it. But unfortunately, nearly sunrise here and so I got to check your link out the next night. In the meanwhile, care to show me 2 snippets (one using DomDoc, other using simple_html_dom) how to scrape inputs from a text input field (<input_type = 'text'>) from html form input field or from search box (like google) ? Once I have learnt that from you, I will try scraping text inputs from blocktext (<textarea>). Getting warmed up to learn scraping as it would make my programming easier to finish building the web crawler. Spider. Thanks!
  12. @requinix I believe you have done web scraping before but with what ? DomDocument or simple_html_dom ? Do you know how to scrape html form's dropdown options ? Say, you want to scrape all the options from the dropdown you see here: https://www.w3schools.com/tags/tryit.asp?filename=tryhtml_select How would you write the code with DomDocument and how would you with simple_html_dom ? If you can be kind enough to show me these 2 then I will try learning them and then try myself to get writing code for a scraper to scrape options off from a radio button and from a checkbox. Then, I can go deeper to scrape options from multi dropdowns and so on. Go deeper into the rabbit whole. As of now, I got no clue where to start. So, care to guide me ?
  13. @mac_gyver If you do not mind, can you show me 2 things ? How to extract link anchors using: A). DomDocument; B). simple_html_dom. And show me where you learnt the snippets from so I can learn more tag extractions from the documents rather than needing to pester you like this.
  14. Can someone show me how to extract meta tags & page title using simple_html_dom() ? I want to compare it's code with DomDocument's code. That is all.
  15. Fellow Programmers, General Php to extract Meta Tags <?php $meta_tags = get_meta_tags('http://www.example.com/'); print_r($tags); ?> // Output: Array ( [keywords] => this is the keywords [description] => this is the description ) DomDocument to extract Meta Tags <?php function file_get_contents_curl($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); $data = curl_exec($ch); curl_close($ch); return $data; } $url = "'http://www.example.com/code"; $html = file_get_contents_curl($url); // Load HTML to DOM object $doc = new DOMDocument(); @$doc->loadHTML($html); // Parse DOM to get Title data $nodes = $doc->getElementsByTagName('title'); $title = $nodes->item(0)->nodeValue; // Parse DOM to get metadata $metas = $doc->getElementsByTagName('meta'); for ($i = 0; $i < $metas->length; $i++) { $meta = $metas->item($i); if($meta->getAttribute('name') == 'description') $description = $meta->getAttribute('content'); if($meta->getAttribute('name') == 'keywords') $keywords = $meta->getAttribute('content'); } echo "Title: $title". '<br/><br/>'; echo "Description: $description". '<br/><br/>'; echo "Keywords: $keywords"; ?> Anyone can shorten the above without sacrificing on quality ?
  16. Programmers, Which one is new out of the following 2 and why some prefer one over the other ? What are their strengths & weaknesses when compared with each other ? A. DomDocument s. simple_html_doc.
  17. @kicken Check it out: <?php $xml = file_get_contents($sitemapUrl); //Should I stick to this line or below line ? // parse the sitemap content to object $xml = simplexml_load_string($sitemapUrl); //Should I stick to this line or above line ? $dom = new DOMDocument(); $dom->loadXML($xml); if ($dom->nodeName === 'sitemapindex') { //parse the index // retrieve properties from the sitemap object foreach ($xml->urlset as $urlElement) //Extracts html file urls. { // get properties $url = $urlElement->loc; $lastmod = $urlElement->lastmod; $changefreq = $urlElement->changefreq; $priority = $urlElement->priority; // print out the properties echo 'url: '. $url . '<br>'; echo 'lastmod: '. $lastmod . '<br>'; echo 'changefreq: '. $changefreq . '<br>'; echo 'priority: '. $priority . '<br>'; echo '<br>---<br>'; } } else if ($dom->nodeName === 'urlset') { //parse url set // retrieve properties from the sitemap object foreach ($xml->sitemapindex as $urlElement) //Extracts Sitemap Urls. { // get properties $url = $urlElement->loc; $lastmod = $urlElement->lastmod; $changefreq = $urlElement->changefreq; $priority = $urlElement->priority; // print out the properties echo 'url: '. $url . '<br>'; echo 'lastmod: '. $lastmod . '<br>'; echo 'changefreq: '. $changefreq . '<br>'; echo 'priority: '. $priority . '<br>'; echo '<br>---<br>'; } } else { //some other I am stuck on the last ELSE as I do not know of any other format for php yo check. No other formats listed here: https://www.sitemaps.org/protocol.html Obviously, you had something in mind. What had you in mind for that ELSE ? Can you finish that ELSE of your's ? Then, I can move onto writing code to extract the meta tags.
  18. @kicken Correction: I meant these: Extract Urls foreach ($xml->urlset as $urlElement) //Extracts html file urls. { // get properties $url = $urlElement->loc; $lastmod = $urlElement->lastmod; $changefreq = $urlElement->changefreq; $priority = $urlElement->priority; // print out the properties echo 'url: '. $url . '<br>'; echo 'lastmod: '. $lastmod . '<br>'; echo 'changefreq: '. $changefreq . '<br>'; echo 'priority: '. $priority . '<br>'; echo '<br>---<br>'; } 2. Extract SiteMaps foreach ($xml->sitemapindex as $urlElement) //Extracts Sitemap Urls. { // get properties $url = $urlElement->loc; $lastmod = $urlElement->lastmod; $changefreq = $urlElement->changefreq; $priority = $urlElement->priority; // print out the properties echo 'url: '. $url . '<br>'; echo 'lastmod: '. $lastmod . '<br>'; echo 'changefreq: '. $changefreq . '<br>'; echo 'priority: '. $priority . '<br>'; echo '<br>---<br>'; } So far, so good ?
  19. Thanks. Bear with me. All this extractors (simple_html_dom, DomDocument) never got through to my head. Let me see if I understanding you or not. In the html language, we call these tag names: <a = href tag <title> = title tag. And so on. Q1. In php or PARSER lang, you do not say these are 'tag' but 'node'. Right ? In html lang, I understand about parent tags and child tags. Do not worry. Currently, I am over here: https://www.sitemaps.org/protocol.html I can see 2 Xml link listing formats: 1. <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.example.com/</loc> <lastmod>2005-01-01</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> <url> <loc>http://www.example.com/catalog?item=12&amp;desc=vacation_hawaii</loc> <changefreq>weekly</changefreq> </url> <url> <loc>http://www.example.com/catalog?item=73&amp;desc=vacation_new_zealand</loc> <lastmod>2004-12-23</lastmod> <changefreq>weekly</changefreq> </url> <url> <loc>http://www.example.com/catalog?item=74&amp;desc=vacation_newfoundland</loc> <lastmod>2004-12-23T18:00:15+00:00</lastmod> <priority>0.3</priority> </url> <url> <loc>http://www.example.com/catalog?item=83&amp;desc=vacation_usa</loc> <lastmod>2004-11-23</lastmod> </url> </urlset> 2. <?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>http://www.example.com/sitemap1.xml.gz</loc> <lastmod>2004-10-01T18:23:17+00:00</lastmod> </sitemap> <sitemap> <loc>http://www.example.com/sitemap2.xml.gz</loc> <lastmod>2005-01-01</lastmod> </sitemap> </sitemapindex> First format lists a tag links. Second format lists xml links to further xml sitemaps. I get that part. Now, to get the php to determine which format is on the page, you said I must write code for it to check the parent node. Right ? So, if it finds the tag/node name is "urlset": <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> Then, I need to write code for php to jump to the <loc tag. And dump the extracted url to hrefs files array (for example) by identifying the url as an html, php file etc (but not another xml file). And, if it finds the tag/node name is "<sitemapindex": <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> Then, I need to write code for php to jump to the <loc tag. And dump the extracted url to xml files array (for example) by identifying the extracted url as another xml file and not an html, php, etc webpage file. Q2. Did I understand you so far ? (Remember, I am a beginner level programmer with no other programming background and so my questions will sound stupid to you. Now, I got this particular code from a programmer few weeks back: $sitemap = 'https://www.***/home-sitemap.xml'; // get sitemap content $content = file_get_contents($sitemap); // parse the sitemap content to object $xml = simplexml_load_string($content); // retrieve properties from the sitemap object foreach ($xml->url as $urlElement) //Extracts html file urls. //foreach ($xml->sitemap as $urlElement) //Extracts Sitemap Urls. { // get properties $url = $urlElement->loc; $lastmod = $urlElement->lastmod; $changefreq = $urlElement->changefreq; $priority = $urlElement->priority; // print out the properties echo 'url: '. $url . '<br>'; echo 'lastmod: '. $lastmod . '<br>'; echo 'changefreq: '. $changefreq . '<br>'; echo 'priority: '. $priority . '<br>'; echo '<br>---<br>'; } I am guessing, the above extracts links from the 1st format. Q3 And to get it to extract links (xml files) from the 2nd format, I must change this: foreach ($xml->url as $urlElement) //Extracts html file urls. to either this: foreach ($xml->sitemap as $urlElement) //Extracts html file urls. Or this: foreach ($xml->sitemapindex as $urlElement) //Extracts html file urls.
  20. @ginerjm I failing to find a list of NodeNames on DomDocument: https://www.php.net/manual/en/class.domnode.php#domnode.props.nodename You know the correct link ? Asking as you know your DomDocument stuffs.
  21. @kicken I do not like copycating others codes as I will never learn to walk by myself. Might asweel nug you a little longer and learn how to use the DOMDOCUMENT. I see you memorised this part: $dom = new DOMDocument(); $dom->loadXML($xml); [/cpde] found here: https://www.php.net/manual/en/domdocument.loadxml.php But from the manual, where did you learn this part: [code] if ($dom->nodeName This code looks at an Xml link's element, I am guessing. Where on the DomDocument manual, is that particular page that shows this particular line of code ? I cannot find it. Look, nowhere is mentioned a link that teaches how to write DomDocument code that checks a link's element or NodeName: https://www.php.net/domdocument Can you kindly point the right link out to me ? EDIT: Tried many times but ChatGpt link you gave fails to load: https://imgur.com/fPm8mjD Same on your end or not ? Thanks
  22. @kicken, I made the first move. Here you go: <?php $url = "https://techalltype.com/"; $html = file_get_contents($url); $doc = new \DOMDocument('1.0', 'UTF-8'); /* instance of DOMDocument */ @$doc->loadHTML($html); /*The function parses the HTML contained in the string source */ $xpath = new \DOMXpath($doc); /*to retrieve selected html data */ $nodes = $xpath->query('//a'); foreach($nodes as $key => $node) { echo $key++.".) ".$node->getAttribute('href')."<br/>"; } Now, can you do the part I am stuck in ?
  23. @kicken Phew! It seems you understood my code intention. >>but with DOMDocument you'd load the XML then check the nodeName to determine if it's a urlset or a sitemapindex.<< Can you be kind enough to show me how to code it to do this ? Talking about this particular part ... >>check the nodeName to determine if it's a urlset or a sitemapindex<< Then, I should be able to move forward from there on. Thanks!
  24. Hiya, This has been doing my head-in for days now! Some crawler codes on the internet exist where you get it to crawl to a webpage to extracts all html links. hrefs. Code such as this one which I found: A. hrefs Extractor - Extracts from html files <?php //1. //General Page Crawler. Not Xml Sitemap Crawler. //--- include_once('simplehtmldom_1_9_1/simple_html_dom.php'); //--- //FAILS //$url = "https://www.rocktherankings.com/post-sitemap.xml"; //$url = "https://bytenota.com/sitemap.xml"; //$url = "https://www.rocktherankings.com/sitemap_index.xml"; //WORKS $url = "https://www.rocktherankings.com/footer-links-seo/"; //WORKS $url = ""; $html = new simple_html_dom(); $html->load_file($url); //-- foreach($html->find("a") as $link) { echo $link->href."<br>"; } ?> And there are those that extract links from xml files. Like these two: 1. Extracts from Xml files //Sitemap Protocol: https://www.sitemaps.org/protocol.html include_once('simplehtmldom_1_9_1/simple_html_dom.php'); //WORKS. //$sitemap = 'https://www.rocktherankings.com/post-sitemap.xml'; //$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files. //FAILS. Shows blank page. $sitemap = "https://bytenota.com/sitemap.xml"; $html = new simple_html_dom(); $html->load_file($sitemap); foreach($html->find("loc") as $link) { echo $link->innertext."<br>"; } 2 Extracts from Xml files //Sitemap Crawler: If starting url is an xml file listing further xml files then it will show blank page and not visit the found xml files to extract links from them. //Sitemap Protocol: https://www.sitemaps.org/protocol.html // sitemap url or sitemap file //FAILS. //$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files. //WORKS //$sitemap = "https://bytenota.com/sitemap.xml"; //$sitemap = 'https://www.rocktherankings.com/post-sitemap.xml'; // get sitemap content $content = file_get_contents($sitemap); // parse the sitemap content to object $xml = simplexml_load_string($content); // retrieve properties from the sitemap object foreach ($xml->url as $urlElement) { // get properties $url = $urlElement->loc; $lastmod = $urlElement->lastmod; $changefreq = $urlElement->changefreq; $priority = $urlElement->priority; // print out the properties echo 'url: '. $url . '<br>'; echo 'lastmod: '. $lastmod . '<br>'; echo 'changefreq: '. $changefreq . '<br>'; echo 'priority: '. $priority . '<br>'; echo '<br>---<br>'; } But can you figure-out the issues I am having with these last 2 crawlers above ? If you try getting them to headover to an xml file (sitemap) that lists further xml links (sitemaps), one chokes. Do try it out yourself without taking my word for it. So, got no choice but to build my own crawler, where when I set it to navigate to an xml sitemap then initially it would check if the listed links on the navigated page are href links or further xml links to more xml files (more sitemaps). Good idea ? So what I did was, I first got my crawler to navigate to an xml file. Starting point page. And now I want to make it to extract all found links and check whether the found links are hrefs or further xml links. If the links are hrefs, then add them to the $extracted_urls array. Else add them to the $crawl_xml_files array. Now later on, the crawler can crawl those extracted href & xml links dumped on both arrays. Now, I am stuck on the part where, the code fails to echo the link extensions of the found links on the initially navigated page. It fails to extract any links to the respective arrays. Here is the code. Test it and see for yourself where I am going wrong. I am scratching my head. //Sitemap Crawler: If starting url is an xml file listing further xml files then it will show blank page and not visit the found xml files to extract links from them. //Sitemap Protocol: https://www.sitemaps.org/protocol.html //$sitemap = 'https://www.rocktherankings.com/post-sitemap.xml'; //$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files. $sitemap = 'https://bytenota.com/sitemap.xml'; // get sitemap content //$sitemap = 'sitemap.xml'; // get sitemap content $content = file_get_contents($sitemap); // parse the sitemap content to object $xml = simplexml_load_string($content); //var_dump($xml); // Init arrays $crawl_xml_files = []; $extracted_urls = []; $extracted_last_mods = []; $extracted_changefreqs = []; $extracted_priorities = []; // retrieve properties from the sitemap object foreach ($xml->url as $urlElement) { // provide path of curren xml/html file $path = (string)$urlElement->loc; // get pathinfo $ext = pathinfo($path, PATHINFO_EXTENSION); echo 'The extension is: ' . $ext; echo '<br>'; //DELETE IN DEV MODE echo $urlElement; //DELETE IN DEV MODE if ($ext == 'xml') //This means, the links found on the current page are not links to the site's webpages but links to further xml sitemaps. And so need the crawler to go another level deep to hunt for the site's html pages. { echo __LINE__; echo '<br>'; //DELETE IN DEV MODE //Add Xml Links to array. $crawl_xml_files[] = $path; } elseif ($ext == 'html' || $ext == 'htm' || $ext == 'shtml' || $ext == 'shtm' || $ext == 'php' || $ext == 'py') //This means, the links found on the current page are the site's html pages and are not not links to further xml sitemaps. { echo __LINE__; echo '<br>'; //DELETE IN DEV MODE //Add hrefs to array. //$extracted_urls[] = $path; // get properties $extracted_urls[] = $extracted_url = $urlElement->loc; //Add hrefs to array. $extracted_last_mods[] = $extracted_lastmod = $urlElement->lastmod; //Add lastmod to array. $extracted_changefreqs[] = $extracted_changefreq = $urlElement->changefreq; //Add changefreq to array. $extracted_priorities[] = $extracted_priority = $urlElement->priority; //Add priority to array. } } var_dump($crawl_xml_files); //Print all extracted Xml Links. var_dump($extracted_urls); //Print all extracted hrefs. var_dump($extracted_last_mods); //Print all extracted last mods. var_dump($extracted_changefreqs); //Print all extracted changefreqs. var_dump($extracted_priorities); //Print all extracted priorities. foreach($crawl_xml_files as $crawl_xml_file) { echo 'Xml File to crawl: ' .$crawl_xml_file; //Print all extracted Xml Links. } echo __LINE__; echo '<br>'; //DELETE IN DEV MODE foreach($extracted_urls as $extracted_url) { echo 'Extracted Url: ' .$extracted_url; //Print all extracted hrefs. } echo __LINE__; echo '<br>'; //DELETE IN DEV MODE foreach($extracted_last_mods as $extracted_last_mod) { echo 'Extracted last Mod: ' .$extracted_last_mod; //Print all extracted last mods. } echo __LINE__; echo '<br>'; //DELETE IN DEV MODE foreach($extracted_changefreqs as $extracted_changefreq) { echo 'Extracted Change Frequency: ' .$extracted_changefreq; //Print all extracted changefreqs. } echo __LINE__; echo '<br>'; //DELETE IN DEV MODE foreach($extracted_priorities as $extracted_priority) { echo 'Extracted Priority: ' .$extracted_priority; //Print all extracted priorities. } echo __LINE__; echo '<br>'; //DELETE IN DEV MODE Can someone be kind enough to fix this by shortening it as much as possible using procedural style programming and show me how you fixed it ? Thanks!
  25. Php Devs, What is wrong with this code ? I see blank page. Where did I go wrong on these two ? Note where the $message variable is on both. I get undefined $message if I leave it at the bottom. So, I tried rearranging to the 2 codes you see below. 1. // Initiate ability to manipulate the DOM and load that baby up $doc = new DOMDocument(); libxml_use_internal_errors(true); // Because we are actually manipulating the DOM, DOMDocument will add complete <html><body> tags we need to strip out //$message = str_replace(array('<body>', '</body>'), '', $doc->saveHTML($doc->getElementsByTagName('body')->item(0))); $message = file_get_contents('https://www.rocktherankings.com/post-sitemap.xml'); $doc->loadHTML($message, LIBXML_NOENT|LIBXML_COMPACT); libxml_clear_errors(); // Fetch all <a> tags $links = $doc->getElementsByTagName('a'); // If <a> tags exist ... if ($links->length > 0) { // For each <a> tag ... foreach ($links AS $link) { $link->setAttribute('class', 'link-style'); } } 2. // Initiate ability to manipulate the DOM and load that baby up $doc = new DOMDocument(); // Because we are actually manipulating the DOM, DOMDocument will add complete <html><body> tags we need to strip out //$message = str_replace(array('<body>', '</body>'), '', $doc->saveHTML($doc->getElementsByTagName('body')->item(0))); $message = file_get_contents('https://www.rocktherankings.com/post-sitemap.xml'); //$message = str_replace(array('<body>', '</body>'), '', $doc->saveHTML($doc->getElementsByTagName('body')->item(0))); libxml_use_internal_errors(true); $doc->loadHTML($message, LIBXML_NOENT|LIBXML_COMPACT); libxml_clear_errors(); // Fetch all <a> tags $links = $doc->getElementsByTagName('a'); // If <a> tags exist ... if ($links->length > 0) { // For each <a> tag ... foreach ($links AS $link) { $link->setAttribute('class', 'link-style'); } }
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.