Jump to content

extracting data from html table


willyeb70

Recommended Posts

Dear all,
I would like to submit a question to which unfortunately I can not find solution. I will briefly explain my problem.

I would like to populate the database with the data that are present within a table in an html file and if possible
repeat this for each html file, I have about 2000 files to process.

I did extensive research on the internet and found some solutions based on Regex and others through a
extension DOM Parser but neither worked properly.

Unfortunately my situation is a little complex because the html file that contains the table has other
Information that I do not need, or other html tag I have to eliminate and then, unfortunately,
the table structure isn't always the same for all files. Basically I have at least 7-8 kinds of tables
and none of them has header tags <TH>. A sample structure is this:

<Table>
<Tr >
<Td >
TABLE 1 </ td>
</ Tr>
<Tr >
<Td> Column1 </ td>
<Td> Column2 </ td>
<Td> Column3 </ td>
<Td> COLONNA4 </ td>
<Td> COLONNA5 </ td>
<Td> COLONNA6 </ td>
<Td> COLONNA7 </ td>
</ Tr>
<Tr >
<Td >
1 </ td>
<Td> USER 1 </ td>
<Td> M </ td>
<Td> ROME </ td>
<Td> RM </ td>
<Td> 11111111 </ td>
<Td> 22222222 </ td>

</ Tr>
........
</ Table>

That 's just an example because in some files columns are not 7 but a different number with
different names.

Do you think I have a chance with PHP or other tools which may include the ability to extract data
and place them in a SQL table?

My little project is obviously not for commercial purposes, it is non-profit and only for study.

Thank you all for your attention.
Greetings
Willy
Link to comment
Share on other sites

Well, it depends. If you can come up with specific rules on how the tables should be processed, then yes. These types of problems should first be analyzed without any thought to how it would be coded. Start by trying to create instructions on how you would explain to person to process the data. If you can do that - THEN proceed to writing code to adhere to those instructions.

 

Looking at the example above, I can *guess* at some possible rules. For example, the first table row (TR) contains the name of the table. Or, does that only apply when there is only one TD in the row? The second row contains the headers for the table. Rows three to the end contain the data associated with those headers. If those are accurate rules, then it is a simple task to read the data and correlate the data to the header names. I could write some sample code, but I;m not going to do that based on a guess of what the rules should be.

 

Now, assuming you can define the rules for getting the data - storing it in the database is another matter. Since the HTML tables are different lengths and have different fields I have no way of knowing how it should be stored. I would have to have some idea on how the data is to be used in order to make an intelligent decision. Do the tables of data have any relationship to one another?

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.