Metal Wing Posted May 13, 2011 Share Posted May 13, 2011 Hello, I am a PHP newbie, and have been mostly playing around with PHP/MySQL in a CMS ares (i.e. building my own CMS). Now I am trying to expand my knowledge outside of the basic php commands, and a particular subject caught my attention. Getting page source code, parsing out bits from it, and displaying it. This is more of my personal goal to get it working, to learn more about some of PHP's abilities, and especially search parameter flag thingies D: http://nexrem.com/scripts/get_source/ That is my webpage i made real quick, as you can see it shows basic stuff. Code for it is: <html> <head> <title>Content Site</title> </head> <body> <p>This is intro text</p> <a href="http://google.com" title="Search Engine">Google Link</a><br /> <a href="http://yahoo.com" title="">Yahoo Page</a></br > <a href="http://http://www.phpfreaks.com" title="Awesome Site">PHP Freaks Help</a><br /> Bottom of file </body> </html> Now, I have a php file called get_link.php >> http://nexrem.com/scripts/get_source/get_links.php <?php $url = 'http://nexrem.com/scripts/get_source/'; $needle = 'google'; $contents = file_get_contents($url); if(strpos($contents, $needle)!== false) { echo 'found'; } else { echo 'not found'; } // The \\2 is an example of backreferencing. This tells pcre that // it must match the second set of parentheses in the regular expression // itself, which would be the ([\w]+) in this case. The extra backslash is // required because the string is in double quotes. $html = $contents; echo $contents; preg_match_all("/(<([\w]+)[^>]*>)(.*?)(<\/\\2>)/", $html, $matches, PREG_SET_ORDER); foreach ($matches as $val) { echo "matched: " . $val[0] . "\n"; echo "part 1: " . $val[1] . "\n"; echo "part 2: " . $val[2] . "\n"; echo "part 3: " . $val[3] . "\n"; echo "part 4: " . $val[4] . "\n\n"; } ?> From what I understand, file_get_contents gets what the user sees? Or it gets the source code, and I just can't output it as such, cause my browser renders it? Question 1: Is it possible to get the html code of the page, rather than what the html renders it to be? How? Question 2: I can just Right click > view source and paste that result into a text file. I think I know how to search for a specific string, but how would I do it recursively, along the lines of: Search for text between <a href=" and " so I get the raw link Then add the results to an array. And then use foreach to output all the links from the array? Any help, hints are appreciated! Thank You! P.S. I quite frankly, got no idea what "/(<([\w]+)[^>]*>)(.*?)(<\/\\2>)/ is... Can someone refer me to a page where I can learn about those expressions? Quote Link to comment https://forums.phpfreaks.com/topic/236288-basic-php-getting-file-contents-parsing-it/ Share on other sites More sharing options...
btherl Posted May 13, 2011 Share Posted May 13, 2011 1. file_get_contents() gets you the html code of the page. You can call that "html source code". You won't get php source code this way though. 2. preg_match_all() can probably do it. I would start with a tutorial on regexp, as the one you have there is quite complex, though it's built up of simple parts. Eg [^>]* means "0 or more characters which are not >". .* means "0 or more of any character", often used to tell preg to ignore some characters you don't want. [\w]+ means "1 or more word characters", where a word character is any letter or digit or the underscore character. The manual is here: http://www.php.net/manual/en/reference.pcre.pattern.syntax.php Quote Link to comment https://forums.phpfreaks.com/topic/236288-basic-php-getting-file-contents-parsing-it/#findComment-1214845 Share on other sites More sharing options...
Metal Wing Posted May 13, 2011 Author Share Posted May 13, 2011 Splendid! Thanks btherl! I got it now, and actually got my script working! Quote Link to comment https://forums.phpfreaks.com/topic/236288-basic-php-getting-file-contents-parsing-it/#findComment-1214863 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.