davieboy Posted May 11, 2015 Share Posted May 11, 2015 Hi there I'm trying to find out how to extract a URL from a tweet if someone tweets me. Can't see anything obvious online. Anyone able to help or advise Thanks Dave Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted May 11, 2015 Share Posted May 11, 2015 Do you have any code at all? Are you using their api or trying to get it directly from your tweets page? If the page is public can use curl or file_get_contents to obtain the raw html. If you need to securely log in need to use curl and also send your name and password. Once you have text holding the url can use preg_match() or preg_match_all() using some regex patterns for urls Methods to scrape the page not using the api simplehtmldom can work Here is an example using DOM and DOMXPath This captures all links on the page and used preg_match to find the shortened ones they used in tweets. You can expand on this or change to your specific needs. It's possible to scrape just specific sections like div,span,class and so on but I kept this simple only looking for a tags within the body tag. <?php $target = "https://twitter.com/hashtag/GameofThrones"; $html = @file_get_contents($target); if (!$html) { die("Unable to connect to that url"); } $hrefs = array(); $url_array = array(); $short_links = array(); $dom = new DOMDocument(); @$dom->loadHTML($html); $xpath = new DOMXPath($dom); $hrefs = $xpath->query('/html/body//a'); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $link = $href->getAttribute('href'); if (!preg_match("~(\/\/|twitter\.com|\/\/t\.co)~i", $link)) { $link = "https://twitter.com" . $link; } $title = trim($href->getAttribute('title')); if ($title == '') { $title = trim($href->plaintext); } if ($title == '') { $title = trim($href->nodeValue); } if ($title == '') { $title = $link; } if ($link != '' && $title != '') { //all links $url_array[] = array( "href" => $link, "title" => $title ); //only shortened links in posts if (preg_match("~(\/\/t\.co)~i", $link)) { $short_links[] = array( "href" => $link, "title" => $title ); } } } $hrefs = array(); //all links if(!empty($url_array)){ $url_array = array_map("unserialize", array_unique(array_map("serialize", $url_array))); echo "<pre>"; print_r($url_array); echo "</pre>"; } //only shortened links if(!empty($short_links)){ $short_links = array_map("unserialize", array_unique(array_map("serialize", $short_links))); echo "<pre>"; print_r($short_links); echo "</pre>"; } ?> Results from all links too long to post here. Results from shortened links: Array( [0] => Array ( [href] => http://t.co/DYNv3xCKZ2 [title] => http://ow.ly/MMeAJ ) [1] => Array ( [href] => http://t.co/E5o56STr1S [title] => pic.twitter.com/E5o56STr1S ) [2] => Array ( [href] => http://t.co/Ozfx1MUjkw [title] => http://huff.to/1E1mKzN ) [3] => Array ( [href] => http://t.co/uNKOLtlvn3 [title] => pic.twitter.com/uNKOLtlvn3 ) [4] => Array ( [href] => http://t.co/6J1fjdAEWO [title] => pic.twitter.com/6J1fjdAEWO ) [5] => Array ( [href] => http://t.co/PvYozVNzlW [title] => http://hypb.st/yMVvB ) [6] => Array ( [href] => http://t.co/Qob84GCHkK [title] => pic.twitter.com/Qob84GCHkK ) [7] => Array ( [href] => http://t.co/oujTU8fi0G [title] => http://bit.ly/1HbKTv6 ) [8] => Array ( [href] => http://t.co/1666f9Isdr [title] => pic.twitter.com/1666f9Isdr ) [9] => Array ( [href] => http://t.co/z16DX5wpwH [title] => http://read.bi/1K0CM5w ) [10] => Array ( [href] => http://t.co/nEdiO9RET9 [title] => pic.twitter.com/nEdiO9RET9 ) [11] => Array ( [href] => http://t.co/4lab9UEohK [title] => http://tmto.es/LsEAu ) [12] => Array ( [href] => http://t.co/tZoKeesROW [title] => http://go.ign.com/VpUdD4c ) [13] => Array ( [href] => http://t.co/uhDQPVCyZa [title] => pic.twitter.com/uhDQPVCyZa ) [14] => Array ( [href] => http://t.co/4UCXMAFU9z [title] => http://rol.st/1IsC6oZ ) [15] => Array ( [href] => http://t.co/1w00G18Fl4 [title] => pic.twitter.com/1w00G18Fl4 ) [16] => Array ( [href] => http://t.co/8H672kaNPY [title] => http://ow.ly/MMeP9 ) [17] => Array ( [href] => http://t.co/hp9UwlY9VZ [title] => pic.twitter.com/hp9UwlY9VZ ) [18] => Array ( [href] => http://t.co/iTotXTT5IV [title] => pic.twitter.com/iTotXTT5IV ) [19] => Array ( [href] => http://t.co/FPaiw4b3TG [title] => pic.twitter.com/FPaiw4b3TG ) [20] => Array ( [href] => http://t.co/my1VXGW2C7 [title] => http://thebea.st/1PB5oka ) [21] => Array ( [href] => http://t.co/w5fQVOGj14 [title] => pic.twitter.com/w5fQVOGj14 ) [22] => Array ( [href] => http://t.co/tCD3AWXR3N [title] => http://ANGRYGOTFAN.COM ) [24] => Array ( [href] => http://t.co/xC5Lz63b9y [title] => pic.twitter.com/xC5Lz63b9y ) [25] => Array ( [href] => http://t.co/I8p5uFjgF7 [title] => pic.twitter.com/I8p5uFjgF7 ) ) Quote Link to comment Share on other sites More sharing options...
davieboy Posted May 12, 2015 Author Share Posted May 12, 2015 HI there appreciate the code, something to look at for sure. Not got anything yet, just trying to get an idea of how to do it. Is it possible to just get the last URL tweeted, say for example, someone tweets my web address, have a script that checks that twitter account (or hashtag search) and will pick the latest one tweeted,and do something with it? Dave Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted May 12, 2015 Share Posted May 12, 2015 Use their api, rip the url from that using preg_match Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.