Using a zillion files to store data???

tcjohans · October 2, 2007

Hi,

I am trying to figure out an architecture for a flat file database.

The overall problem is how to random access the data, when dealing with variable or no set length.

So, one idea I am considering is to put such data in separate files. Each field with variable-length data would have its particular data stored in a separate file for each record. So, e.g. if we are constructing a table "users" to a web site, then the field "user_email" might have variable-length data and thus be put in a separate file according to this idea. The name of each such file would be unique. E.g. for userid 2454, the user_email value would be stored in a file named "2454_user_email.txt". (The possible benefit of this is that you would be able to quickly access the data in a "random-access" manner; you just look for the appripriate file name - instead of needing to search through a file - and there you've got the data; instead of PHP searching a text for the data, you let the server search for the file with the right name.)

Anyway, this means that there would be a very large number of files for a table with many rows. For instance, for the above example, if the site has 50,000 users, then you would have 50,000 files just for specifying their various emails.

Now, my question is:

Does this make sense? 50,000 files sounds a lot: would it take a lot of time to find the necessary file?

Also, could performance be improved e.g. by dividing up the files into different folders (according to some system) rather than having all of them in the same folder?

Another thing, this whole idea would mean that a whole lot of files might need to be opened and closed just to search through a table. For instance, if I want to make a search for all users whose username contain the string "ade" (or whatever), I would need to open and close 50,000 files (given the above example). If, on the other hand, this data is all stored on the same file, there would only be one file to open and close; but then PHP would still need to search through it for all occurrences of email data - which also might be very time-consuming.

Any ideas on the matter?

Thomas

cooldude832 · October 2, 2007

yes it is time consuming and what is the point. Flat file has been considered pointless since mysql became stable and FREE!

Edit:

If you trying to make your own database language props to you, but that is a cooperate level development that someone posing the question on a public forum probably can't handle.

tcjohans · October 2, 2007

yes it is time consuming and what is the point. Flat file has been considered pointless since mysql became stable and FREE!

Would you say that it is more time-consuming than storing all data in the same file and make separate PHP-based searches for the data in that one file?

As for the rest, just assume for a moment that it does have some point... I am aware of MySQL and use it frequently but right now I feel a need for a flat file data handling system.

cooldude832 · October 2, 2007

what is compelling you to a flat file vs a mysql? With flat file you have no choice, but to read the entire length of the file. Also you lose the power of indexing, queries and so forth which are what make any database powerful

tcjohans · October 2, 2007

what is compelling you to a flat file vs a mysql? With flat file you have no choice, but to read the entire length of the file. Also you lose the power of indexing, queries and so forth which are what make any database powerful

Oh, I could easily do that with PHP. I can random-access a single file at any point if I know exactly where to look for the data. That works fine with records of fixed length (and which thus only contain fixed-length fields) as long as I have information of how long each row is etc. The problem is with data of variable length, which needs to be handled separately.

Also, support for indexing, queries etc. can "easily" be created with PHP as well - it just takes a little bit of work and a good work and architectural plan.

As I see it, the creation of a database faces in principle the same architectural problems whether it is compiled or based on PHP. The main difference is that a compiled program is somewhat quicker.

I don't want to shift focus from the original question - which is not about the different merits of PHP-based databases vs MySQL - but I do think there can be some benefits in a flat file based system, for instance (i) you can do more things, have greater flexibility in building in functionality that you need; (ii) some web sites don't have access to MySQL; and (iii) I am looking for a module that can easily be incroporated in other PHP modules to provide a database-like backend for their needs. etc.

Anyway, the issue I am asking about here is about the way one would architecture a flat file database: is using a separate file for each variable data field a good or bad solution in comparison with storing it all in one single file and reading it with PHP (a analogous problem is faced by MySQL as well)?

Thomas

cooldude832 · October 2, 2007

well opening a lot of files = bad as each one is loaded into the memory, however loading one large file is the same issue.

I guess you need to look at how php specifically handles a file opening and handlers. Opening a lot of small files vs 1 large file might be good.

I see the best answer is not Files = 1 or Files = infinity. It is some value in between that offers the best solution. However I will say this, php is considered very crude compiled C. It doesn't handle stuff in a very optimized way, and is considered bulk, ugly by high end developers working in higher end scripts. This might come back to haunt you as php can't handle the file fast enough.

Also the fact still remains that php can not get a partial file content based on a search criteria. It must gather X bytes of a file or all bytes, to search you will have to read the whole file, vs a back end system that knows the file structure and can figure out what it wants. Not saying you can't set it up, I'd be interested in the solution, however I just don't see it as a phesable solution for a single person.

tcjohans · October 2, 2007

well opening a lot of files = bad as each one is loaded into the memory, however loading one large file is the same issue.

I guess you need to look at how php specifically handles a file opening and handlers. Opening a lot of small files vs 1 large file might be good.

I see the best answer is not Files = 1 or Files = infinity. It is some value in between that offers the best solution. However I will say this, php is considered very crude compiled C. It doesn't handle stuff in a very optimized way, and is considered bulk, ugly by high end developers working in higher end scripts. This might come back to haunt you as php can't handle the file fast enough.

Also the fact still remains that php can not get a partial file content based on a search criteria. It must gather X bytes of a file or all bytes, to search you will have to read the whole file, vs a back end system that knows the file structure and can figure out what it wants. Not saying you can't set it up, I'd be interested in the solution, however I just don't see it as a phesable solution for a single person.

If opening a file means that it is all read into memory, then there is a problem of course (but just a problem...) The ideal thing then would be if somehow one can access the data that is part of a file without actually opening it in the sense of loading it all to memory. Someone mentioned something about using "file streaming concepts" instead and I am trying to figure out what that means and see if it could be a solution.

Also, I think I have actually found an idea of how to get a piece of data from a file without reading all of it. There needs to be information somewhere about in what file and in what position a particular field (or its data) is stored and of what length it is. If such information can be provided - and it can with a reasonably well-done system - then accessing such a limited part of a file shouldn't be a problem, say with fseek(), without needing to read the entire file. Would you see a problem here?

Also, maybe I should add that I think maybe the most problematic performance issue has to do with doing extensive searches in very large databases - like handling SELECT ... WHERE... statements. What I am thinking about is that maybe it would be possible to eventually develop a separate small database searcher program in C++ for that particular task and with which PHP should be able to interact (just like it interacts with MySQL).

cooldude832 · October 2, 2007

that was what I was going to mention you might be able to develop a C or C++ program to do this and use exec to do it

hungryOrb · October 2, 2007

Using windows file system over MySQL? I smell an accident

tcjohans · October 3, 2007

Using windows file system over MySQL? I smell an accident

I made an experiment yesterday. Created 30,000 files with short arbitrary content. Gave names like 1.txt, 2.txt, etc. to facilitate file searching and have a best-case scenarion. Then used a loop with file_get_contents to read each in turn. 10,000 files took 34 secs and 30,000 took 140 seconds... So, I'll build it with just one or a few files instead!

tcjohans · October 3, 2007

well opening a lot of files = bad as each one is loaded into the memory, however loading one large file is the same issue.

I doubt this is correct. I checked the PHP manual yesterday for fopen(), and it mentioned nothing about fopen() involves loading the file into memory. It says: "fopen() binds a named resource, specified by filename, to a stream."

I _guess_ that what happens is - if this makes sense - that fopen creates a dynamic pointer to the start position of the file, and that a variable name is then associated with that pointer (but I have only a rudimentary understanding of the deep processes here, so I may be completely wrong or just speak nonsense here).

In any event, if I am right, there should thus be no need to read an entire file to memory - if information is provided somewhere (some sort of indexes) about exactly where each piece of information is to be found (file and file position) or if there is an easy manner of calculating the file and location. (E.g. if each record has a fixed length, same for all, the position where a record starts in the file would be smething like: row number * record length.) One could use fseek() to get to such a position, and then use fgets() and fgetc() to get content starting at that position (without having to read or load the entire file). Could someone confirm if I am right???

cooldude832 · October 3, 2007

yes fopen is just a resource binding not the physical data, but your idea of fseek is not going to work, nor will setting a parameter on fget to a certain byte length. This is simply because the fact that a flat file grows and shrinks lineally vs a mysql that grows and shrinks in 3 directions. Sure you can say I want a certain entry out of file A because File B tells your the structure of File A, but the problem then arrises taht although you know its generalized location that location is dependent on the location of all data in front of it (or behind if you attempt to read backwards), this is the linear growth I am talking about. Now there is a way around this if you force pad every input the the inputs max size. I.e if you have a varchar that is 16 characters you pad it to the right with spaces to fill 16. Then you can know the records location based on an index and pull it no problem (Of course you have to read the index initially so at some point you have to read a whole file. However this is going to result in some astronomical files if you attempt to create a blogging/fourm db as you will need to pad to L(max) which will probably realistically be 5-5000 times larger than L(true). L is the length of the input in bytes. Making any sense now?

tcjohans · October 3, 2007

yes fopen is just a resource binding not the physical data, but your idea of fseek is not going to work, nor will setting a parameter on fget to a certain byte length. This is simply because the fact that a flat file grows and shrinks lineally vs a mysql that grows and shrinks in 3 directions. Sure you can say I want a certain entry out of file A because File B tells your the structure of File A, but the problem then arrises taht although you know its generalized location that location is dependent on the location of all data in front of it (or behind if you attempt to read backwards), this is the linear growth I am talking about. Now there is a way around this if you force pad every input the the inputs max size. I.e if you have a varchar that is 16 characters you pad it to the right with spaces to fill 16. Then you can know the records location based on an index and pull it no problem (Of course you have to read the index initially so at some point you have to read a whole file. However this is going to result in some astronomical files if you attempt to create a blogging/fourm db as you will need to pad to L(max) which will probably realistically be 5-5000 times larger than L(true). L is the length of the input in bytes. Making any sense now?

Hi, I am not sure exactly if I understand - but I hope I do. My plan is to split data into two parts: data that is of fixed length and data that is of variable length. All data of fixed length is held in one file, which holds no variable-length data. So in this file there should not be a problem with exact random-access to the data, right?

Then I put all variable data in another file. Here I am thinking in terms of working in "blocks". Basically, for illustration, say if you are working with 10 blocks of data, occupying positions:

Block Positions

1 1-10

2 11-20

3 21-30

etc.

Then suppose you edit block 2, so that the result is just 6 bytes (instead of 10). If that happens, you just split block 2 into two new ones. You now have the following structure:

1 1-10 Occupied

2 11-16 Occupied

3 17-20 Free

4 21-30 Occupied

etc.

(By the way, the blocks don't have unique or fixed ID's - they're just identified by the position they start at.)

Now, block 3 would be free for 4 bytes of data if sometime something of that size needs to be allocated to free space.

Alternatively - let's go back to the starting point - if instead of edting block 2 to 6 bytes, you edit it to, say, 22 bytes.

Now, that data won't fit into block 2 any longer. The solution will be either of two:

(1) You find some other free blocks for it (if necessary by merging two or more empty adjacent blocks).

(2) You extend the file, appending a new block at the very end, and put the data in that block.

In turn, the old block 2 would now be free to hold new data if some would need some eventually.

In any event, with this system, editing and deleting data would not need to call for changing all the data's starting points. Each change just calls for one's piece of data to be changed, and an index could feasibly be maintained registering all pieces' starting points.

Does this make sense and does it respond to the problem you had in mind. I hope I understood you...

What did you mean when you said that MySQL grows and shrinks in 3 directions, by the way??

tcjohans · October 3, 2007

Now there is a way around this if you force pad every input the the inputs max size. I.e if you have a varchar that is 16 characters you pad it to the right with spaces to fill 16. Then you can know the records location based on an index and pull it no problem (Of course you have to read the index initially so at some point you have to read a whole file.

Just a small point: I don't think reading an index necessarily must mean you read an entire file. Not if the index file has fixed-length rows. Suppose I want to know the position of data A corresponding to record 1234 and have an index that tells me this. Suppose also that the index file has the same length, L, for each row. Then I can simply calculate the position X where data A's position is written in the index file as (L-1) * 1234 - or some similar calculation depending on the structure of the index file. Then I fopen() the file, get to position X by means of fseek(), and read from that position by means of fgets(). No entire file read at all, though it is opened.

chronister · October 3, 2007

I may be mistaken, but I believe that it has to read the entire file into memory, then places the pointer at the beginning of the file at which point you can then call fseek() and go to the position you want.

I don't see how PHP can just go to a particular place in a file without having at least read all data prior to the seek point in the file.

fopen is the handle for the file. fseek, fgets, and all other file functions rely on this being called first. fopen read the entire file into memory and you can then work with that data. But you have to have the fopen handle first.

Anyone agree??

nate

cooldude832 · October 3, 2007

Just a small point: I don't think reading an index necessarily must mean you read an entire file. Not if the index file has fixed-length rows. Suppose I want to know the position of data A corresponding to record 1234 and have an index that tells me this. Suppose also that the index file has the same length, L, for each row. Then I can simply calculate the position X where data A's position is written in the index file as (L-1) * 1234 - or some similar calculation depending on the structure of the index file. Then I fopen() the file, get to position X by means of fseek(), and read from that position by means of fgets(). No entire file read at all, though it is opened.

Conceptually your wrong, your indexing files are what you are seraching for for your records otherwise you be seraching the database files themselves which you are trying not to do. Either way it can't be done without a massive writing of a normalize databasing system, and even then you'll be 5 years behind what sql, odbc, oreacal can all do

tcjohans · October 4, 2007

Just a small point: I don't think reading an index necessarily must mean you read an entire file. Not if the index file has fixed-length rows. Suppose I want to know the position of data A corresponding to record 1234 and have an index that tells me this. Suppose also that the index file has the same length, L, for each row. Then I can simply calculate the position X where data A's position is written in the index file as (L-1) * 1234 - or some similar calculation depending on the structure of the index file. Then I fopen() the file, get to position X by means of fseek(), and read from that position by means of fgets(). No entire file read at all, though it is opened.

Conceptually your wrong, your indexing files are what you are seraching for for your records otherwise you be seraching the database files themselves which you are trying not to do. Either way it can't be done without a massive writing of a normalize databasing system, and even then you'll be 5 years behind what sql, odbc, oreacal can all do

I am not sure I understand when you say that "conceptually" I am wrong... Please tell me what you mean. Possibly you mean that I will need to read the entire index file according to the structure I've outlined. (In which I would use fixed-length records so as to be able to calculate the relevant position and then fseek - but I am just repeating myself here...)

I think the issue really maybe just comes out to whether calling fopen() involves reading the entire file or not to memory.

Also, I do agree that for the system to work it must in principle have an advanced structure, with indexes etc., like mysql et al. But that's just the fun part... In principle, PHP and C++ would face essentially the same architectonical issues with regard to a project of this kind. For instance, MySQL is also based on reading and writing to a few text files. In the end, however, PHP would end having less performance. I read somewhere that PHP is about 30% slower than compiled programs - if so, I reckon that a pure PHP-based database would be about 30% slower than e.g. MySQL.

In any event, there are a few possible advantages or considerations:

- Particular tasks that require faster or special treatment can by and by be "outsourced" to small C++ written executables. E.g. extensive searches. This solution might be able to make up for any performance issues vis-a-vis MySQL.

- If the architecture per se is sound but the implementation is based in PHP and open source, then I could imagine that by and by people would easily be able to contribute small fixes, additional functionality, etc. There might in this way eventually be a lot more things people could achieve with this system than with e.g. MySQL.

- A PHP-based system will be easier to integrate in PHP-sites than MySQL.

- In principle, MySQL is an external interface to PHP, so doing away with it might _possibly_ be a performance gain in _some_ aspects.

- Also, MySQL is not oriented primarily to web applications and PHP. I don't know now concretely what that possibly could mean in terms of performance for PHP, but I am sure there must be some aspects in which PHP does not get all it could have if MySQL would have been built mainly for PHP applications. (That's the way things generally are when something which is built for A is also adapted for B.)

Sign In

Using a zillion files to store data???

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information