Jump to content

Recommended Posts

New to PHP and MySQL.  Not sure I'm even posting this in the right area but here it goes. :confused:

 

I have html and text files with data in them.  Not delimited in the normal way at all.  Most of the text is in paragraph form but all the files have the same data in them. For example a page might look like this:

 

Item1 text

 

Item2

Text

Text

Text…

 

Item3 text

 

Item 4

 

Text

Text

 

Text

Text

Text

 

Item# = a name like Year/item number/description... etc.

 

There are about 25 items (or fields) each varies in length and paragraph style.  For instance Item 4 in the example might just have one word or it might have 7 paragraphs.  This would be easy if I only had two dozen files... but I have upwards of 100,000+ files, most are .html on a CD. :-\

 

OH one more thing... the many of the 'Item titles' are followed by a :  (description:) but not all item names have it.

 

I'm not very DB literate but I am IT/PC literate.  I really need to find a quick and hopefully semi-automated way to import/convert this information in batches.  Even if I could get it into excel or access, I could get it into PHP/MySQL from there myself.

 

Don’t know if it matters but one of the fields has a photo, which I just need the name/link from not the photo.

 

Please let me now if you have any ideas or need more information.

 

Thank you!

JConey

 

Link to comment
https://forums.phpfreaks.com/topic/204650-importing-or-converting-problem/
Share on other sites

BTW:  Nto sure if it matters for this particular question but...

 

My Host is running:

MySQL version 5.0.90-community-log

Apache version 2.0.63

PHP version 5.2.9

 

I also use MS Excel/Access 2007 (or earlier), PHPMagic Pro & Plus, Adobe CS3 Master Suite... 

 

Thanks Again!

JConey

Basically I have a lot of .html and .txt files that have data in them and I extract that data.  If I can get it into a excel spreadsheet, MS access I can get it to MYSQL from there.  A closer estimate is about 152,400 files.  I'd love to find a batch method or some automated or semi automated way to extract this data a useable format. (Spreadsheet, MS Access Table, MySQL Table...) 

 

The first post explains the file contents.  I'm pretty good at importing data if there is a consistent delimiter.  I have two issues here as I see it.

 

#1 how do I handle so many files without repeating a set of procedures 152K times.

 

#2 the lack of a consistent delimiter

 

The files all have the same type of information and in the same order though.  I'm sure someone in cyber space has run into this before I just hope the solution is not in the realm of theoretical physics.

 

 

Thanks,

Jeff

 

 

Sorry, I didn't realize at which stage you were stuck.

 

The easiest answer is multi-pass.  Write a short script that iterate though all the files in a directory (for example), and then just does a basic, dumb import based on file type (or whatever else you know for certain).

 

Then once you have it in MySQL, you can work your PHP magic to fix it, until there's nothing wrong.

I think I found my solution but it wasn't what I set out looking for.

 

I was looking in the wrong direction.  As I surfed for a solution I stumbled on data mining, page scraping and data harvesting. Most of the files I have to work with are .HTML so I dug into how to use these methods and I came up with gold.

 

First I created a .html file with a link to all the files... that was simpler than thought it would be.

 

Once all the files we're "linked" by the new file that I created, I could run web-harvest (sourceforge) or any number of other tools available on the web. As soon as the files were all linked the program treated it as a site and surfed the entire thing extracting the data I wanted.  Web-Harvest took some playing around with to configure but it worked in the end.

 

That made me think about it and if you ever run into a website that has information you need spread all through it this same tactic would work perfectly, as a matter of fact that is what these tools were really created for.

 

I'd recommend HT Track or web2Disk by InSpyder to capture the website's content than run web-harvest to extract the data to a CSV, spreadsheet or what ever you need.

 

All these tools mentioned are available on the web some free and some not free but cheap just the same.

 

Keep this information in mind, might come in handy some day!

 

Thank you - to all that gave thought to my problem!

Jeff

 

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.