Jump to content

Extracting text from website


Grodo

Recommended Posts

I am looking for the easiest method to parse a webpage and extract the text from it.  I realize that I could use the function get_file_contents() and then do a preg replace on all html tags.  But is there an easier fuction to just get the plain text or is the only way of doing it is to use preg replace?  What I am trying to do is open the Fedex website search a tracking number and then parse the text of the website to a string.  After the text is parsed I will search through it and extract the status, tracking number, and date delivered. 

 

Problem with this code is that it returns the html coding

<?php
if($getcon = file_get_contents("http://www.fedex.com/Tracking?ascend_header=1&clienttype=dotcom&cntry_code=us&language=english&tracknumbers=222222222222222"))
   {
      echo $getcon;
   } 
else {
   echo "Error: Could not connect to page..."
}
?> 

Link to comment
Share on other sites

Thanks for the quick reply but that is not the solution...  You were on the right track of what I wanted to do.  The striptags() had no effect to the variable. I have also tried to set the strip_tags() to a variable via $page = strip_tags($getcon); and echo out ($page) but it did the same thing as strip_tags($getcon). 

 

My current code is

<?php
if($getcon = file_get_contents("http://www.fedex.com/Tracking?ascend_header=1&clienttype=dotcom&cntry_code=us&language=english&tracknumbers=222222222222222"))
   {
      strip_tags($getcon);
   } 
else {
   echo "Error: Could not connect to page..."
}
?> 

Link to comment
Share on other sites

OK I just had a quick look at the source of the page you are trying to extract data from, I assume you are trying to get the table that lists status? Well in the source it is nicely surrounded by '<!-- BEGIN Scan Activity -->' and '<!-- END Scan Activity -->', you can use strpos() to find where these are as a string index, and then you can just get the data in between the two.

Link to comment
Share on other sites

and here we go...

 

<?php
// http://www.phpfreaks.com/forums/index.php/topic,186985.0.html

if($getcon = file_get_contents("http://www.fedex.com/Tracking?ascend_header=1&clienttype=dotcom&cntry_code=us&language=english&tracknumbers=222222222222222")) {
  $one = strpos($getcon, "<!-- BEGIN Scan Activity -->");
  $final = substr($getcon, $one);
  $two = strpos($final, "<!-- END Scan Activity -->");
  $final = substr($final, 0, $two);
  
  echo $final;
} 
else {
  echo "Error: Could not connect to page...";
}

?>

Link to comment
Share on other sites

THANKS for the awesome reply Alecks! I didnt even know that those commands existed.  Now correct me if I am wrong...

 

strpos set the position of the string to read.

substr removes all parts up to the begin scan activity line.

 

the example you coded for me works perfectly!  Now it time for phase two which I dont think would be to difficult and that is searching through the string... Ill attempt this in a few hours and post my results :)

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.