Jump to content

Parsing a flat file to load into MySQL


chadrt

Recommended Posts

I am hoping I have the forum correct it was a toss up.  I am looking to parse out what I believe to be a flat file.  I want to load this file into a MySQL database it seems to be very dificult to read a lot of lines multiple rows and very time consuming.  I want to put it in a MySQL so I can create indexs that will make the querys much faster.  Thanks for any help you may be able to provide me.

 

Here is the location of the file http://status.irlp.net/nohtmlstatus.txt

 

Chad

Link to comment
Share on other sites

Wow, lots of data! I've copied and pasted the lot into my text editor and can see that you've got all the columns separated by tabs. What I'd do is replace all these tab characters with a character that isn't used, like the bar (|) symbol.

 

Next create a table in your database that will hold the data.

 

I'd write a PHP script to read the file in then loop through each line exploding it. That would separate each piece of data into a long array which you can then easily add into the database.

Link to comment
Share on other sites

I've replaced all the tabs for you with a | symbol and uploaded it here for you:

http://www.pictureinthesky.net/external/nohtmlstatus.txt

 

Let me know when you've saved it and I'll delete it.

 

Using PHP, in a loop use on the file to explode the contents into an array:

$array=explode('|',$row);

$array is the array containing the line $row.

Link to comment
Share on other sites

Changing the tabs to pipes seams logical.  Using that I could easily load the file into mysql using BASH shell scripts that I created for something else I have already done.  But the problem is that I have to laod the data from that file every couple minutes.  That file is used to determine connection status of amateur radios.  we dont have access to the actual database that does that but that file is produced every three minutes so it is semi accurate.  My thoguht was to possibly load that into a mysql database.  I need figure out what each of the colums are VAR, INT ect. and then I will create it but I tried to load it into a database once already but I had a heck of time.  I thought it was TAB delimited but not being able to change it with a script is very dificult.  If there was a way to do a replace on the fly or a way to use a mysql statement that would load it as is into a database that would be great.  Like the one below.

 

delete from AM;
load data local infile './AM.dat' into table AM fields terminated by '|';

 

That would be an ideal situation to work with.  I could run that in a shell script every couple minutes and it would only around 10 seconds to run the database would be fairly easy to keep updated.

 

Chad

Link to comment
Share on other sites

You can change it with a PHP script.

 

Inside a loop, run the following:

$withbars=strreplace("\r","",$withtabs);

 

Run that on each line and you'll do the same as I have done - I converted them to bars using my text editor for speed.

 

Now you've said you're getting this list so often it may also be a good idea to look at how you may want to enter this data. One option would be to use FTP and upload it each time but the easiest way would be to write a script to upload it instead, then parse the data and populate a database with the results.

 

Two questions:

1. Do you need to keep the old data you uploaded previously?

2. If you need to keep it is the data to be kept separate or appended onto the old?

Link to comment
Share on other sites

I have written some basic shell scripts in the past to handle some of the tasks you speak of.  The idea I came up with was a small shell script that looks like the following.

 

#/bin/bash

cd ~/irlp
wget http://status.irlp.net/nohtmlstatus.txt
time /usr/bin/mysql -u DBUSER -pDBPASS --local-infile irlp < load.sql
rm nohtmlstatus.txt

 

That little snippet run by cron would do exactly what I need it to do.  But I am not sure how to word load.sql to handle the tab delimited file or if it is even capable of handling it.  Will it freeze up because of the trailing data on the first line of the flat file or will it ignore that because it there is not field defined for it.  I cant get it to recognize tab delimiters.  The snippet in my previous posting was what I tried before only I tried putting things like "T" or "t" even "tab" and "TAB" for the field termination.

Link to comment
Share on other sites

Here's a little command line awk+sed routine to convert that file into a typical MySQL "dump" file, suitable for importing like "mysql dbname < dumpfile.sql".  I only captured a few lines in the nohtmlstatus.txt file for this example, but you get the idea.  This can be used in a cron/sh script.

 

$ awk 'BEGIN{FS="\t+";getline; print "INSERT INTO tbl VALUES"}{printf "(%d,\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",%d,\"%s\",%d,%d,%d,%d,%d,%d,\"%s\",\"%s\",%d),\n",$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17}' nohtmlstatus.txt | sed '$s/,$/;/' > output.sql
$ cat output.sql
INSERT INTO tbl VALUES
(1000,"VE7RHS","Vancouver","BC","Canada","IDLE",1000,"1998-11-12",49,-123,1184001135,145,-600,100,"ve7ltd","http://www.ars.ams.ubc.ca",1184000779),
(1010,"VE7RHS","Vancouver","BC","Canada","IDLE",1010,"2005-12-30",49,-123,1184000812,441,5000,0,"VE7LTD","http://www.ars.ams.ubc.ca",1183947442),
(1015,"VE7RNV","North Vancouver","BC","Canada","IDLE",1015,"2003-07-08",49,-123,1184000963,444,5000,0,"VE7TLH","http://www.nsarc.ca",1183835505),
(1020,"VE7RNA","Chemainus","BC","Canada","OFFLINE",1020,"2001-01-20",48,-123,1184001029,146,-600,0,"VE7FRG","http://www.cvars.com",1183305560),
(1030,"VE7VIC","Victoria","BC","Canada","IDLE",1030,"2001-05-15",48,-123,1184001051,146,-600,100,"VE7BEU","http://www.ve7vic.ca",1183686731);

Link to comment
Share on other sites

It is a real pitty we cant just edit our previous posts...

 

Wildbug,

I have a couple of questions regarding this method that you posted.  Once I try to load it into a database there are problems.  Here is the error itself.

 

ERROR 1064 (42000) at line 1: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '_blank"",1184113474),
(2018,"VE2REH","Gatineau","QC","Canada","IDLE",2018,"2002-' at line 93
s15242893:/irlp # ls
.  ..  irlp.sh  nohtmlstatus.txt  output.sql
s15242893:/irlp #

 

I have tried everything I can think of to make it work.

Link to comment
Share on other sites

2009	VE9LC	Saint John -EchoIRLP #285237	NB	Canada	IDLE	2009	2005-07-23	45.30333	-66.07316	1184001810	147.2700	600.0000	0.00	VE9NPS	http://www.lcarc.ca/          TARGET="_blank"	1183941823	O

 

That's the line generating the error.  It's the double quotes ("_blank") that are causing the problem; they need to be escaped.

 

Here's a solution:

awk 'BEGIN{FS="\t+";getline; print "INSERT INTO tbl VALUES"}{gsub(/"/,"\\\"");printf "(%d,\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",%d,\"%s\",%d,%d,%d,%d,%d,%d,\"%s\",\"%s\",%d),\n",$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17}' nohtmlstatus.txt | sed '$s/,$/;/' > output.sql

 

The gsub() function will replace any " with \".

 

You may be able to write the whole command in one line without using intermediary files.

 

mysql db -u user -ppass < \
wget -qO - http://status.irlp.net/nohtmlstatus.txt | \
awk 'BEGIN{FS="\t+";getline; print "INSERT INTO tbl VALUES"}{gsub(/"/,"\\\"");printf "(%d,\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",%d,\"%s\",%d,%d,%d,%d,%d,%d,\"%s\",\"%s\",%d),\n",$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17}' nohtmlstatus.txt | \
sed '$s/,$/;/'

Link to comment
Share on other sites

Correction to last code example in previous post:

 

mysql db -u user -ppass < \
wget -qO - http://status.irlp.net/nohtmlstatus.txt | \
awk 'BEGIN{FS="\t+";getline; print "INSERT INTO tbl VALUES"}{gsub(/"/,"\\\"");printf "(%d,\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",%d,\"%s\",%d,%d,%d,%d,%d,%d,\"%s\",\"%s\",%d),\n",$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17}' | \
sed '$s/,$/;/'

Link to comment
Share on other sites

Here is what I have so far I am using it in a short script so a one liner is not necessary but what I keep getting is this

 

Your script line:

awk 'BEGIN{FS="\t+";getline; print "INSERT INTO tbl VALUES"}{gsub(/"/,"\\\"");printf "(%d,\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",%d,\"%s\",%d,%d,%d,%d,%d,%d,\"%s\",\"%s\",%d),\n",$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17}' nohtmlstatus.txt | sed '$s/,$/;/' > output.sql

 

My Output:

ERROR 1136 (21S01) at line 1: Column count doesn't match value count at row 1

 

I am going to check my table and be sure that I havent messed up something.  Just out of curiosity would using the wrong column "type" cause that problem.  I am not that proficieint so I used a bunch of VARCHAR fields to create the table.

 

Table SQL:

CREATE TABLE `IRLP` (
  `node` tinyint(5) NOT NULL default '0',
  `callsign` varchar( NOT NULL default '',
  `city` varchar(50) NOT NULL default '',
  `prov_state` varchar(25) NOT NULL default '',
  `country` varchar(25) NOT NULL default '',
  `status` varchar(20) NOT NULL default '',
  `record` tinyint(6) NOT NULL default '0',
  `install_date` varchar(10) NOT NULL default '',
  `lat` varchar(15) NOT NULL default '',
  `long` varchar(15) NOT NULL default '',
  `lastupdate` varchar(10) NOT NULL default '',
  `freq` varchar(25) NOT NULL default '',
  `offset` varchar(10) NOT NULL default '',
  `pl` varchar(20) NOT NULL default '',
  `owner` varchar(75) NOT NULL default '',
  `url` varchar(100) NOT NULL default '',
  `last_status` varchar(25) NOT NULL default '',
  `avrs_status` varchar(5) NOT NULL default '',
  PRIMARY KEY  (`node`),
  KEY `node` (`node`),
  KEY `callsign` (`callsign`),
  KEY `prov_state` (`prov_state`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

 

 

Link to comment
Share on other sites

Actually I think I found the problem but I am not sure how exactly how to fix it.  There are 18 fields but only 17 in the conversion script.  I could fix it but I am not sure what each field should be, like wether to put quotes or not.  The dates and such are what I need to know about I was not sure.  Does having a date hyphenated mean that it should be quoted or is it still a number filed that doesnt need to be quoted.

 

node

callsign

city

prov_state

country

status

record

install_date

lat

long

lastupdate

freq

offset

pl

owner

url

last_status

avrs_status

 

I am trying to understand how that line works so that I am learning it rather than having someone else do it for me all the time.  Would not do me or any one here any good if I come here to simply have others write the code for me.

Link to comment
Share on other sites

You're right; I missed the last column.

 

Updated awk line only:

awk 'BEGIN{FS="\t+";getline; print "INSERT INTO tbl VALUES"}{gsub(/"/,"\\\"");printf "(%d,\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",%d,\"%s\",%.6f,%.6f,%d,%.6f,%.6f,%.2f,\"%s\",\"%s\",%d,\"%s\"),\n",$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18}' nohtmlstatus.txt

 

I looked at your table definition, and here is a suggested replacement:

 

CREATE TABLE `IRLP` (
`node` SMALLINT UNSIGNED NOT NULL DEFAULT 0,     
`callsign` VARCHAR( NOT NULL DEFAULT '',
`city` VARCHAR(64) NOT NULL DEFAULT '',
`prov_state` VARCHAR(64) NOT NULL DEFAULT '',
`country` VARCHAR(64) NOT NULL DEFAULT '',
`status` VARCHAR(20) NOT NULL DEFAULT '',
`record` SMALLINT UNSIGNED NOT NULL DEFAULT 0,
`install_date` DATE NOT NULL DEFAULT '0000-00-00',
`lat` FLOAT(9) NOT NULL DEFAULT '',
`long` FLOAT(9) NOT NULL DEFAULT '',
`lastupdate` INT UNSIGNED NOT NULL DEFAULT 0,
`freq` FLOAT(9) NOT NULL DEFAULT 0,
`offset` FLOAT(9) NOT NULL DEFAULT 0,
`pl` DECIMAL(5,2) NOT NULL DEFAULT 0,
`owner` VARCHAR(75) NOT NULL DEFAULT '',
`url` VARCHAR(255) NOT NULL DEFAULT '',
`last_status` INT UNSIGNED NOT NULL DEFAULT 0,
`avrs_status` VARCHAR(5) NOT NULL DEFAULT ''
  PRIMARY KEY  (`node`),
  KEY `node` (`node`),
  KEY `callsign` (`callsign`),
  KEY `prov_state` (`prov_state`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

 

I mostly left VARCHAR lengths alone, but changed the column types to best suit the data.  In some cases your integer columns were not large enough to hold the data (as in TINYINT on the first column can only hold integers up to 128 in SIGNED mode; with SMALLINT UNSIGNED you'll be good up to 65535).  With the proper column types you'll be able to better query your data.  I didn't test the CREATE TABLE statement, so let me know if there are any errors I missed.

 

As far as the awk code, I just learned it myself.  Awk is a UNIX program that reads a file line-by-line, assumes the data is in columns, and can perform functions on this data before outputting it.  Statements consist of a regular expression and an action.  Lines matching the regex get the action.  Omitting either regex or action means "all lines" or "print the line," respectively.  Two special regexes are BEGIN and END, and they mean to perform the action before or after reading the file, respectively.

 

BEGIN {                                 # Before reading anything, do this
FS="\t+";                       # Change the Field Separator to one or more tabs
getline;                        # Read in the first line (header) and discard it
print "INSERT INTO tbl VALUES"  # Print the beginning of our SQL statement
}
{                          # No regex; do this for every line
gsub(/"/,"\\\"")   # Perform a global substitution, replacing " with \"
                   # $0 represents the entire line; $1 represents the first field, $2 second, etc.
                   # printf is just like printf in C or PHP
printf "(%d,\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",%d,\"%s\",%.6f,%.6f,%d,%.6f,%.6f,%.2f,\"%s\",\"%s\",%d,\"%s\"),\n",
	$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18
}

sed is a UNIX command line stream editor (sed); I used it to replace
the last comma from the last line with a semi-colon.

sed '$s/,$/;/'

The first $ means "the last line only."
,$ is a regular expression matching a terminating comma.
; is the replacement.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.