Extract text data from an HTML table, between the <td></td> tags.

schapel · August 2, 2009

Here is the situation I'm in, and where I'm stuck. Maybe someone has some suggestions for me.

I have a script that visits a particular website search engine, and punches in a search term. The script already extracts the valid block of html data which is the raw results, in a specific table on the page.

So, the data that displays on the scraper script is everything between these <table> tags. What I need to do is grab the information from a specific column on the table, between the <td></td> tags. The trick to this problem is that I'm guessing most of you will suggest to run regex on the results to capture everything between <td></td>, however, I typically ONLY need the first column or second column of data.

Is there some way to have regex or some other function loop through my results, then find the </tr> or </th> tag, which then triggers is to only extract data from the NEXT <td> </td> cell. This is the only way I could think of to identify the location of a 'first column' cell, because there is no special style attribute or any other identifier in the <td> tag.

Any suggestions?

watsmyname · August 2, 2009

Here is the situation I'm in, and where I'm stuck. Maybe someone has some suggestions for me.

I have a script that visits a particular website search engine, and punches in a search term. The script already extracts the valid block of html data which is the raw results, in a specific table on the page.

So, the data that displays on the scraper script is everything between these <table> tags. What I need to do is grab the information from a specific column on the table, between the <td></td> tags. The trick to this problem is that I'm guessing most of you will suggest to run regex on the results to capture everything between <td></td>, however, I typically ONLY need the first column or second column of data.

Is there some way to have regex or some other function loop through my results, then find the </tr> or </th> tag, which then triggers is to only extract data from the NEXT <td> </td> cell. This is the only way I could think of to identify the location of a 'first column' cell, because there is no special style attribute or any other identifier in the <td> tag.

Any suggestions?

you got to use php DOM. You can get pre build class with usage examples from here

http://simplehtmldom.sourceforge.net/

a very useful class

schapel · August 2, 2009

Great link by the way, the functions on that page once you download them are very easy to use. That is exactly what I needed, although I was worried about moving away from Regex.

Thanks much.

watsmyname · August 2, 2009

Great link by the way, the functions on that page once you download them are very easy to use. That is exactly what I needed, although I was worried about moving away from Regex.

Thanks much.

nice to know that it helped you mate, regex is only the thing programmers would like to stay away from.

Sign In

Extract text data from an HTML table, between the <td></td> tags.

Recommended Posts

schapel

Link to comment

Share on other sites

watsmyname

Link to comment

Share on other sites

schapel

Link to comment

Share on other sites

watsmyname

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information