Jump to content

Convert Raw HTML to Mysql


bschultz

Recommended Posts

I have a bunch of pages (recipes) that were hard coded a few years ago in straight html.  Now, I put every NEW recipe into a database.  I'd like to take all of these raw html files...and insert the data from them into the db.

 

- The date is on line 15 of the html file (May 29, 2007<br>) - - - need to convert this to a mysql date format and remove the <br />

 

- I need to remove everything (not insert into db) lines 1-14

 

- The html files have <p> instead of <br />...I'd like to replace the <p> with <br />

 

- I need to ignore the last three lines (have closing </body> and </html> lines and some other pure html stuff

 

What is  this the best way to do this?

Link to comment
Share on other sites

1. Crawl the HTML files, and save them in a file locally.

2. Use this: http://stackoverflow.com/questions/215896/how-to-use-php-to-delete-x-number-of-lines-from-the-beginning-of-a-text-file

To strip out X number of lines from the file (in your case, 14) from the beginning.

3. Use standard regex for the string replaces (or str_replace).

Then just strip off the last 3 lines (reverse engineer some of what you found in #2, or just str_replace over </body> and </html> with an empty space.

Link to comment
Share on other sites

Alright, I've figured out the code to do what I need it to...but as I've been looking through the static html pages, the code I want to start with doesn't always start on line 15.  The line of code I want to start with is always the date (August 01, 2011 formatted)...how can I remove everything before the date when I don't know what line the date is on?

Link to comment
Share on other sites

Still plugging away at this.  I've decide to try to search the array for the date (since the htm page is named by the date, I should be able to find it's value in the array).

 

Here's the latest code:

 

<?php
foreach (glob("/home/briansch/public_html/testing/*.htm") as $filename) {   //get all filenames that have a .htm extension
$file = $filename;

$find2[] = '.htm';  //remove the extension of the file to start the process of searching the html file for the date
$replace2[] = '';
$text2 = str_replace($find2, $replace2, $filename);
$new_date_line2 = $text2;  //now we have the name of the file (which is a date)...with the directory path listed before it

$find3[] = '/home/briansch/public_html/testing/';    //  remove the directory path from the name of the file
$replace3[] = '';
$text3 = str_replace($find3, $replace3, $new_date_line2);
$new_date_line4 = $text3;  //now we have the date of the file



$mysql_date_format2 = date("F j, Y", strtotime($new_date_line4));   /// convert the file name (which is in m-d-y format)
//echo $mysql_date_format3;


$lines = file($filename);   // put the lines of the file into an array
$count = count($lines);  // used below to remove the last three lines from the array
//$key = array_search($mysql_date_format2, $lines);       // this didn't do anything
$findme[] = $mysql_date_format2;   // tried this with and without the []...no difference...nothing echoed below
$key = array_search( $findme, $lines );
echo "<br /><br />Line of Date is - $key<br />--------------<br />";




/// everything below this line works...

$date_line = $lines[13];   // I will need to change the array number to the number found above in $key
$find[] = '<br>';
$replace[] = '';
$text = str_replace($find, $replace, $date_line);
$new_date_line = $text;


$title_line = $lines[(14)];   // I will need to change the array number to the value of $key + 1
$find2[] = '</strong>';
$replace2[] = '';
$text2 = str_replace($find2, $replace2, $title_line);
$new_title_line = $text2;



$mysql_date_format = date("Y-m-d", strtotime($new_date_line)); 

foreach ($lines as $line_num => $line) 
{
if ($line_num <= 14) 
{
echo ""; 
}
elseif ($line_num >= ($count - 3))
{
echo ""; 
}
else
{
$thisline = htmlspecialchars($line);
$recipe .= $thisline;
}
}
echo "$mysql_date_format<br />";
echo "$new_title_line<br />";
echo "$recipe<br /><br />---------------------------------<br /><br /><br /><br />";

}

?>

 

As you can see in the comments, it's not finding the value of the date in the array.  Am I off base in how I'm going about this?

 

Thanks!

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.