Jump to content

Extract URL from a tweet


davieboy

Recommended Posts

Do you have any code at all?

Are you using their api or trying to get it directly from your tweets page?

 

If the page is public can use curl or file_get_contents to obtain the raw html.

If you need to securely log in need to use curl and also send your name and password.

 

Once you have text holding the url can use preg_match() or preg_match_all() using some regex patterns for urls

 

Methods to scrape the page not using the api

simplehtmldom can work

 

Here is an example using DOM and DOMXPath

 

This captures all links on the page and used preg_match to find the shortened ones they used in tweets.

You can expand on this or change to your specific needs.

It's possible to scrape just specific sections like div,span,class and so on but I kept this simple only looking for a tags within the body tag.

<?php
$target = "https://twitter.com/hashtag/GameofThrones";
$html   = @file_get_contents($target);

if (!$html) {
    die("Unable to connect to that url");
}

$hrefs     = array();
$url_array = array();
$short_links = array();
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->query('/html/body//a');
for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $link = $href->getAttribute('href');
   
    if (!preg_match("~(\/\/|twitter\.com|\/\/t\.co)~i", $link)) {
        $link = "https://twitter.com" . $link;
    }
    $title = trim($href->getAttribute('title'));
   
    if ($title == '') {
        $title = trim($href->plaintext);
    }
   
    if ($title == '') {
        $title = trim($href->nodeValue);
    }
   
    if ($title == '') {
        $title = $link;
    }
   
    if ($link != '' && $title != '') {
  //all links
        $url_array[] = array(
            "href" => $link,
            "title" => $title
        );
 
  //only shortened links in posts
  if (preg_match("~(\/\/t\.co)~i", $link)) {
            $short_links[] = array(
            "href" => $link,
            "title" => $title
        );
  }
    }
}
$hrefs = array();


//all links
if(!empty($url_array)){
    $url_array = array_map("unserialize", array_unique(array_map("serialize", $url_array)));
echo "<pre>";
print_r($url_array);
echo "</pre>";
}

//only shortened links
if(!empty($short_links)){
$short_links = array_map("unserialize", array_unique(array_map("serialize", $short_links)));

echo "<pre>";
print_r($short_links);
echo "</pre>";
}
?>

Results from all links too long to post here.

 

Results from shortened links:

Array
(
    [0] => Array
        (
            [href] => http://t.co/DYNv3xCKZ2
            [title] => http://ow.ly/MMeAJ
        )

    [1] => Array
        (
            [href] => http://t.co/E5o56STr1S
            [title] => pic.twitter.com/E5o56STr1S
        )

    [2] => Array
        (
            [href] => http://t.co/Ozfx1MUjkw
            [title] => http://huff.to/1E1mKzN
        )

    [3] => Array
        (
            [href] => http://t.co/uNKOLtlvn3
            [title] => pic.twitter.com/uNKOLtlvn3
        )

    [4] => Array
        (
            [href] => http://t.co/6J1fjdAEWO
            [title] => pic.twitter.com/6J1fjdAEWO
        )

    [5] => Array
        (
            [href] => http://t.co/PvYozVNzlW
            [title] => http://hypb.st/yMVvB
        )

    [6] => Array
        (
            [href] => http://t.co/Qob84GCHkK
            [title] => pic.twitter.com/Qob84GCHkK
        )

    [7] => Array
        (
            [href] => http://t.co/oujTU8fi0G
            [title] => http://bit.ly/1HbKTv6
        )

    [8] => Array
        (
            [href] => http://t.co/1666f9Isdr
            [title] => pic.twitter.com/1666f9Isdr
        )

    [9] => Array
        (
            [href] => http://t.co/z16DX5wpwH
            [title] => http://read.bi/1K0CM5w
        )

    [10] => Array
        (
            [href] => http://t.co/nEdiO9RET9
            [title] => pic.twitter.com/nEdiO9RET9
        )

    [11] => Array
        (
            [href] => http://t.co/4lab9UEohK
            [title] => http://tmto.es/LsEAu
        )

    [12] => Array
        (
            [href] => http://t.co/tZoKeesROW
            [title] => http://go.ign.com/VpUdD4c
        )

    [13] => Array
        (
            [href] => http://t.co/uhDQPVCyZa
            [title] => pic.twitter.com/uhDQPVCyZa
        )

    [14] => Array
        (
            [href] => http://t.co/4UCXMAFU9z
            [title] => http://rol.st/1IsC6oZ
        )

    [15] => Array
        (
            [href] => http://t.co/1w00G18Fl4
            [title] => pic.twitter.com/1w00G18Fl4
        )

    [16] => Array
        (
            [href] => http://t.co/8H672kaNPY
            [title] => http://ow.ly/MMeP9
        )

    [17] => Array
        (
            [href] => http://t.co/hp9UwlY9VZ
            [title] => pic.twitter.com/hp9UwlY9VZ
        )

    [18] => Array
        (
            [href] => http://t.co/iTotXTT5IV
            [title] => pic.twitter.com/iTotXTT5IV
        )

    [19] => Array
        (
            [href] => http://t.co/FPaiw4b3TG
            [title] => pic.twitter.com/FPaiw4b3TG
        )

    [20] => Array
        (
            [href] => http://t.co/my1VXGW2C7
            [title] => http://thebea.st/1PB5oka
        )

    [21] => Array
        (
            [href] => http://t.co/w5fQVOGj14
            [title] => pic.twitter.com/w5fQVOGj14
        )

    [22] => Array
        (
            [href] => http://t.co/tCD3AWXR3N
            [title] => http://ANGRYGOTFAN.COM
        )

    [24] => Array
        (
            [href] => http://t.co/xC5Lz63b9y
            [title] => pic.twitter.com/xC5Lz63b9y
        )

    [25] => Array
        (
            [href] => http://t.co/I8p5uFjgF7
            [title] => pic.twitter.com/I8p5uFjgF7
        )

)

Link to comment
Share on other sites

HI there

appreciate the code, something to look at for sure.

Not got anything yet, just trying to get an idea of how to do it.

 

Is it possible to just get the last URL tweeted, say for example, someone tweets my web address, have a script that checks that twitter account (or hashtag search) and will pick the latest one tweeted,and do something with it?

Dave

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.