Jump to content

Extract text data from an HTML table, between the <td></td> tags.


schapel

Recommended Posts

Here is the situation I'm in, and where I'm stuck.  Maybe someone has some suggestions for me.

 

I have a script that visits a particular website search engine, and punches in a search term. The script already extracts the valid block of html data which is the raw results, in a specific table on the page.

 

So, the data that displays on the scraper script is everything between these <table> tags.  What I need to do is grab the information from a specific column on the table, between the <td></td> tags.  The trick to this problem is that I'm guessing most of you will suggest to run regex on the results to capture everything between <td></td>, however, I typically ONLY need the first column or second column of data.

 

Is there some way to have regex or some other function loop through my results, then find the </tr> or </th> tag, which then triggers is to only extract data from the NEXT <td> </td> cell.  This is the only way I could think of to identify the location of a 'first column' cell, because there is no special style attribute or any other identifier in the <td> tag.

 

Any suggestions?

Here is the situation I'm in, and where I'm stuck.  Maybe someone has some suggestions for me.

 

I have a script that visits a particular website search engine, and punches in a search term. The script already extracts the valid block of html data which is the raw results, in a specific table on the page.

 

So, the data that displays on the scraper script is everything between these <table> tags.  What I need to do is grab the information from a specific column on the table, between the <td></td> tags.  The trick to this problem is that I'm guessing most of you will suggest to run regex on the results to capture everything between <td></td>, however, I typically ONLY need the first column or second column of data.

 

Is there some way to have regex or some other function loop through my results, then find the </tr> or </th> tag, which then triggers is to only extract data from the NEXT <td> </td> cell.  This is the only way I could think of to identify the location of a 'first column' cell, because there is no special style attribute or any other identifier in the <td> tag.

 

Any suggestions?

you got to use php DOM. You can get pre build class with usage examples from here

http://simplehtmldom.sourceforge.net/

 

a very useful class

Great link by the way, the functions on that page once you download them are very easy to use.  That is exactly what I needed, although I was worried about moving away from Regex.

 

Thanks much.

nice to know that it helped you mate, regex is only the thing programmers would like to stay away from. :)

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.