Jump to content

Grabbing multiple pieces of information from some html markup


Wuhtzu

Recommended Posts

Hey

 

I'm trying to grab several pieces of information from my routers system status page and I can't figure out the smartest way to match and store the information. The page is displayed below along with its markup.

 

router_system_status.png

 

<html><head>
  <meta http-equiv='content-type' content='text/html;charset=iso-8859-1'>
<title>Web Configurator</title>
<SCRIPT src="General.js"></SCRIPT>
</head>
<body marginwidth="0" marginheight="0">
<table border="0">
  <tr> 
    <td width="500" colspan="3"> </td></tr><tr> 
    <td colspan="3" class="header2"> System up Time:<b>     0:01:19</b> </td></tr><tr> 
    <td colspan="3">CPU Load:<b>  0.95%</b></td></tr><tr> 
    <td colspan="3"> </td></tr><tr> 
    <td colspan="3" class="header2"> WAN Port Statistics:</td></tr><tr> 
    <td colspan="3"> Link Status:<b> Up			 </b> </td></tr><tr> 
    <td colspan="3">Upstream Speed:<b>   764 kbps</b></td></tr><tr> 
    <td colspan="3">Downstream Speed:<b>  8059 kbps</b> </td></tr><tr> 
    <td colspan="3"> 
      <table border="1" cellspacing="0" cellpadding="1" align=left>

        <tr> 
          <td class="TableTilte"> 
            <div align=center> Node-Link</div></td><td class="TableTilte"> 
            <div align=center> Status</div></td><td class="TableTilte"> 
            <div align=center> TxPkts</div></td><td class="TableTilte"> 
            <div align=center> RxPkts</div></td><td class="TableTilte"> 
            <div align=center> Errors</div></td><td class="TableTilte"> 
            <div align=center> Tx B/s</div></td><td class="TableTilte"> 
            <div align=center> Rx B/s</div></td><td class="TableTilte"> 
            <div align=center> Up Time</div></td></tr><tr> 
          <td class="TableItem"> 
            <div align=center>  1-PPPoA</div></td><td><div align=center> Up    </div></td><td><div align=center> 148</div></td><td><div align=center> 156</div></td><td><div align=center> 0</div></td><td><div align=center> 342</div></td><td><div align=center> 122854</div></td><td><div align=center>     0:00:19</div></td></tr></table></td></tr><tr> 
    <td colspan="3"> </td></tr>  <tr> 
    <td colspan="3">LAN Port Statistics:</td></tr><tr> 
    <td colspan="3"> 
      <table border="1" cellspacing="0" cellpadding="1" align=left>

        <tr> 
          <td class="TableTilte"> 
            <div align=center> Interface:</div></td><td class="TableTilte"> 
            <div align=center> Status</div></td><td class="TableTilte"> 
            <div align=center> TxPkts</div></td><td class="TableTilte"> 
            <div align=center> RxPkts</div></td><td class="TableTilte"> 
            <div align=center> Collisions</div></td></tr><tr> 
          <td class="TableItem"> 
            <div align=center> Ethernet</div></td><td><div align=center>100M/Full Duplex</div></td><td><div align=center>429</div></td><td><div align=center> 451</div></td><td><div align=center> 0</div></td></tr>                </table></td></tr><tr> 
    <td colspan="3">  </td></tr></table>

</body></html>

 

I need to get the "value" of the following entries:

 

System up Time

Link Status

Upstream Speed

Downstream Speed

Status

Up Time

 

I have all the above markup stored in a variable (obtained through curl) and now I need to extract the information from it and in the end store it in an array like this:

 

Array
(
    [system_up_time] => 24:02:30
    [link_status] => up
    [upstream speed] => 764
    [ect] => ect
)

 

I have no problem writing a regular expression which matches each piece of information separately but that way round I get a huge amount of preg_match() calls. Is that the way to do it, match each "piece of information" with it's own regex and a "dedicated" preg_match call? Or is there a smarter way round?

 

I'm not looking for anyone to write the script, just ideas on how to structure the script / the information extraction. Any input will be much appreciated.

 

Wuhtzu

Link to comment
Share on other sites

If you're markup remains static in its structure, certain things will always follow others

 

so an easy way would be to use .*?first item.*?seconditem

 

to make sure the . does not eat up past the item, you will have to make the item match very specific and take a substring out of it

 

ie:

haystack:

hellolobye

 

u want to match last 'lo' the one before 'bye'

 

instead of .*?lo

you could say:

.*?(lo)bye        //because you know 'bye' will always come after the last 'lo' in the static structure

Link to comment
Share on other sites

Thank you for your excellent solution dsaba - your method works like a charm and is more neat than 10 calls to preg_match()

 

<?php
// Array containing the sub patterns
$regex_array = array('system_up_time' => '([0-9]{1,}:[0-9]{2}:[0-9]{2})',
			     'cpu_usage' => '([0-9]{1,3}\.[0-9]{1,})',
			     'link_status' => '(Down|Initializing|Up)',
			     'upstream_speed' => '([0-9]{1,}) kbps',
			     'downstream_speed' => '([0-9]{1,}) kbps',
		             'status' => '<td><div align=center> (N\/A|Idle|LCP Up|Up)',
		             'txpkts' => '([0-9]{1,})',
                                     'rxpkts' => '([0-9]{1,})',
			     'errors' => '([0-9]{1,})',
			     'txbs' => '([0-9]{1,})',
			     'rxbs' => '([0-9]{1,})',
			     'up_time' => '([0-9]{1,}:[0-9]{2,}:[0-9]{2,})',
			     'lan_txpkts' => '<div align=center>([0-9]{1,})<',
			     'lan_rxpkts' => '([0-9]{1,})',
			     'lan_collisions' => '([0-9]{1,})'
			     );


// Array containing all the fields (name of a pieces of data)
$fields_array = array_keys($regex_array);

// Construct the regular expression without delimiters
foreach($regex_array as $key => $subpattern) {
	$regex .= '.*?' . $subpattern;
}

// Extract information from $sysstatistics_adsl
preg_match('/' . $regex . '/s',$sysstatistics_adsl,$tmp_matches);
?>

 

 

Link to comment
Share on other sites

Instead of returning numeric keys with preg_match_all() you can custom keys per matched subgroup, you can incorporate this into your function ie:

(?P<customKey>subgroup regex)

 

See this post:

http://www.phpfreaks.com/forums/index.php/topic,185238.msg829648.html#msg829648

 

 

This way you can get your

Array

(

    [system_up_time] => 24:02:30

    [link_status] => up

    [upstream speed] => 764

    [ect] => ect

)

 

directly from the matches array in preg_match_all()

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.