N-Bomb(Nerd) Posted April 30, 2009 Share Posted April 30, 2009 Hello, I'm trying to index links on a website using a php script of mine however I'm completely lost as how to do this. I've never used Regex really before so it's not even like I know where to start really. The whole source of the website is already in a string, and I know the all of the links inside the source look like this: <a href="lyrics.php?id=4809">Artist Name</a> The number after ?id= is always going to be random, as well as the "Artist Name". <a href="lyrics.php?id=*">*</a> Above: * = Random/Unknown The page has hundreds of links like this, and I'm trying to get them into an array to process into a database. Even if we're able to figure out how to extract all of these from the string, how am I able to keep the Id and Artist Name together so I can add them to a database? Quote Link to comment Share on other sites More sharing options...
Adam Posted April 30, 2009 Share Posted April 30, 2009 Untested... $source = __something__; preg_match_all('/<a href="lyrics\.php\?id=([\d]+)">([\w\s"\'-]+)<\/a>/i', $source, $link_matches); foreach ($link_matches as $key => $match) { $links[] = array( 'id' => $match[1], 'title' => $match[2], ); } By the way that will find link's with obscure names like.. "Artist's"_-_Name - though I doubt there is any! Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted April 30, 2009 Share Posted April 30, 2009 In the first capture, you don't really need to surround the \d in a character class, as \d is actually a short hand character class... That second capture could be simplified by using [^<]+ (as names don't have < in them, this will capture pretty much anything in between > and <). Also untested... preg_match_all('/<a href="lyrics\.php\?id=(\d+)">([^<]+)<\/a>/i', $source, $link_matches); Quote Link to comment Share on other sites More sharing options...
N-Bomb(Nerd) Posted April 30, 2009 Author Share Posted April 30, 2009 I know this may seem a bit noob, but again I've never really dealt with arrays, more specifically an array like this. How would I access the "$links" part of the array? I'm guessing you would have to do something like: foreach ($links as $xxx => $xxxx) { xxxxx; xxxxx; checkIfExists($id, $title); } checkIfExists() is going to see if the id and title already exist in the database and if it doesn't it will add it.. I've already got the function checkIfExists() finished though.. I'm stuck at getting the appropriate id and title to the function though. Any help? Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted April 30, 2009 Share Posted April 30, 2009 How would I access the "$links" part of the array? I'm guessing you would have to do something like: foreach ($links as $xxx => $xxxx) { xxxxx; xxxxx; checkIfExists($id, $title); } Inside the foreach loop, simply echo $xxxx? Example: $arr = array(1,2,3); foreach($arr as $val){ echo $val . "<br />\n"; // this will echo out the values (1 2 and 3) respectively... } If you want to display the keys as well as the values, you can do this: foreach($arr as $key => $val){ echo $key . ' ' . $val . "<br />\n"; // this will echo the keys and their values....(0 1, 1 2, 2 3) } Quote Link to comment Share on other sites More sharing options...
N-Bomb(Nerd) Posted May 1, 2009 Author Share Posted May 1, 2009 I'm trying to use the following to test around before I add it into my script and I'm getting this output: 0 Array 1 Array 2 Array Code: <?php $source = '<a href="lyrics.php?id=4809">Creed</a><a href="lyrics.php?id=2511">Tupac</a>'; //preg_match_all('/<a href="lyrics\.php\?id=([\d]+)">([\w\s"\'-]+)<\/a>/i', $source, $link_matches); preg_match_all('/<a href="lyrics\.php\?id=(\d+)">([^<]+)<\/a>/i', $source, $link_matches); foreach ($link_matches as $key => $match) { $links[] = array( 'id' => $match[1], 'title' => $match[2], ); } foreach($links as $key => $val){ echo $key . ' ' . $val . "<br />\n"; } ?> I'm completely lost.. I'm just trying to echo out the array in this format "ID - Title" Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted May 1, 2009 Share Posted May 1, 2009 Array element [0] is what the complete pattern stores from preg_match_all gets stored into... each capture (the parts in the pattern that are in parenthesis) is stored into [1],[2] etc.. so if you want to pair the captures together, this is one way you can do it: $source = '<a href="lyrics.php?id=4809">Creed</a><a href="lyrics.php?id=2511">Tupac</a>'; preg_match_all('/<a href="lyrics\.php\?id=([\d]+)">([^<]+)<\/a>/i', $source, $link_matches); $link_matches = array_combine($link_matches[1], $link_matches[2]); foreach($link_matches as $key => $val){ $links[] = array('id'=>$key, 'title'=>$val); } Basically, I take the array $link_matches and merge it with itself (only merging element 1 and 2) This does two things.. it removes the 0 element completely (which contains the complete pattern matching), and in essence makes it a one dimensional array by making the values of element 1 as keys, and the values of element 2 as those new keys values (hope I was clear in explaining that). You can see for yourself what this new version of $link_matches array by doing: echo "<pre>".print_r($link_matches, true); But the snippet above then delves into this array and puts each key and value into id and title respectively.. so if you wanted to output the first id, you would simply do: echo $links[0]['id']; Does this make things clearer? Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.