Jump to content

Parsing large text file.


genu

Recommended Posts

Hello, I'm working on a application that returns bible text from a text file. Basically, the user will enter in the url something like: www.mydomain.com/index.php?ver=eng&book=genesis&chapter=1&verse=5

 

then my script will parse the file eng.txt, which has the whole bible in english for that criteria. Here is how the text file looks like:

 

01O	1	1		10	In the beginning God created the heaven and the earth.
01O	1	2		20	And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.
01O	1	3		30	And God said, Let there be light: and there was light.
01O	1	4		40	And God saw the light, that it was good: and God divided the light from the darkness.
01O	1	5		50	And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day.
01O	1	6		60	And God said, Let there be a firmament in the midst of the waters, and let it divide the waters from the waters.
01O	1	7		70	And God made the firmament, and divided the waters which were under the firmament from the waters which were above the firmament: and it was so.
01O	1	8		80	And God called the firmament Heaven. And the evening and the morning were the second day.
01O	1	9		90	And God said, Let the waters under the heaven be gathered together unto one place, and let the dry land appear: and it was so.
01O	1	10		100	And God called the dry land Earth; and the gathering together of the waters called he Seas: and God saw that it was good.
01O	1	11		110	And God said, Let the earth bring forth grass, the herb yielding seed, and the fruit tree yielding fruit after his kind, whose seed is in itself, upon the earth: and it was so.
01O	1	12		120	And the earth brought forth grass, and herb yielding seed after his kind, and the tree yielding fruit, whose seed was in itself, after his kind: and God saw that it was good.
01O	1	13		130	And the evening and the morning were the third day.
01O	1	14		140	And God said, Let there be lights in the firmament of the heaven to divide the day from the night; and let them be for signs, and for seasons, and for days, and years:
01O	1	15		150	And let them be for lights in the firmament of the heaven to give light upon the earth: and it was so.
01O	1	16		160	And God made two great lights; the greater light to rule the day, and the lesser light to rule the night: he made the stars also.
01O	1	17		170	And God set them i...

 

the first number "010" is the book name. In this case "genesis", the second number "1" is the book number (as in the first book of the bible). The next number "1" is the verse number, and the last number i'm not sure what the significance is.

 

Can somebody point me in some sort of direction to create a script to parse this file, and filter out only certain parts of it? I've played around with fopen and fgets, what are other function that I can try?

 

thanks

 

every single verse is on a separate line in the file,

Link to comment
Share on other sites

As im guessing the file is too big, the ideal choice would be a database. It is more efficient and with unique ids for each phrase, easier. If youre going for this approach, then using string functions (i wish i helped u with regex) u can easily parse the lines. Im assuming that the phrase number (ex: 0101220) is calculated from url variables.

 

<?php
$handle = fopen('file.txt', 'r');
$content = fread($handle, filesize('file.txt'));
$number = '0101220'; //normally calculated from url variables.
$content = substr($content, strpos($content, $number) + strlen($number));
$content = substr($content, 0, strpos(nl2br($content), '<br />'));
echo $content;
?>

 

What it basically does is trim the text before the $number (in this case: '0101220') and then re-trim the text with the first occurrence of a br. Tested it and it should do the job perfectly, but as i said this isnt the perfect way to achieve smth like this. Use a database instead.

Link to comment
Share on other sites

Well, I'm not sure how may verses there are in the Bible, but using a text file for this purposes seems very inefficient. I would suggest using a database. You could create a script that would read the file one time and create all the records needed. That one-time process would be similar to reading the file each and every time. But, getting the results from a database would be much faster.

 

In any case I would suggest using file() which reads the text file as an array with each line of the text file as an array element. You can them loop through the array and explode each line using the tab character (I assume that is the delimeter) and search for the appropriate values.

 

You will also need to have a lookup list to translate the book name to the book number. For example

<?php

$books = array (
   '010' = 'Genesis',
   '020' = 'Exodus',
  /// etc.
);

?>

 

(not tested)

<?php

//Lookup values
$bk_lu = $_GET['book'];
$ch_lu = $_GET['chapter'];
$vs_lu = $_GET['verse'];
$lang = $_GET['ver'];

//Read file into array
$verses = file($lang . '.txt');

foreach ($verses as $verse_data) {

 list($bk_no, $ch_no, $vs_no, $text) = $verse_data;

 if ($books[$bk_no]==$book_lu && $ch_no==$ch_lu && $vs_no==$vs_lu) {
   echo trim($text);
   break;
 }

}

?>

Link to comment
Share on other sites

well I could easily just query each line, to insert it in a database, but I assument working with files would be a better solution... Would the database be faster? or just more convenient and secure?

 

If a file solution would be better, u think we would be using databases? Its way more efficient, as u return one value vs all the values at a time and u can insert/update/delete them by a simple query.

Link to comment
Share on other sites

well I could easily just query each line, to insert it in a database, but I assument working with files would be a better solution... Would the database be faster? or just more convenient and secure?

 

If a file solution would be better, u think we would be using databases? Its way more efficient, as u return one value vs all the values at a time and u can insert/update/delete them by a simple query.

 

thats good point...

 

I'll go ahead and try to migrate it to the database, and I'll see how I can go from there...

Link to comment
Share on other sites

Ok I was able to import the whole bible into a database with each verse being in a separate field. Now, I want to have it so that when a users passes a chapter/verse through the url, it will fetch that from the database. For example: if they type this:

 

index.php?search=Matthew 1:1-10, Mark 1:1, Luke 1:1, John 1:1

 

Let me know if this is a good approach.

 

First I use explode(',' $_GET['search) and then use foreach at least two more times to loop through the array and split it even further and then in the end query the database.

 

Is there a simpler method to to separate that string? Because I want it to parse it correctly even when its not in that form. For example, it should work if the user types just index.php?search=Matthew 1:1-10 or Matthew1:1-10,Mark1:1 or even Matthew 1:1-10, 1 John 1:1, basically from all those examples, I need to split into arrays based on the book its in (EX: "John" or "1 John") the chapter, and then the verse sequence (EX "1-10"), which in turn I will use explode('-',$verses) again to get the start verse, and the end verse.

 

Regular expresssions aren't my strong point, but would it be easier to accomplish this task with regular expresssions?

Link to comment
Share on other sites

Well, you can't use a space in a URL so your proposal is out of the question (you have to use %20 to represent a space). I'm also curious - are you really expecting user's to type that information into a URL (Me being a bad speller would probably type "Mathew" instead of "Matthew" and get no results)? Are you not going to provide some sort of user interface for the user to get the data they want?

Link to comment
Share on other sites

As previously said, its not a good way to have %20 in the url as it isnt humanly readable. It looks like u have a difficult scenario as even with regex, i cant think of an automated way of parsing a very dymanic url variable. The only way would be having different url variables, so u know exactly what youre getting. Ex:

 

index.php?book=matthew1:1-10&smth=mark1:1&smthelse=like1:1&smthelselse=john1:1

 

Eventhough, all those variables would be needed if u have a category>subcategory system. In this case just an id=xx (the id of the phrase) would be enough. If u want to have a search page, make it search the phrases content. Or if u want someone to search on book numbers, verse numbers etc, add those columns to the db so each phrase is associated with its book and verse number. It may be a bit difficult to populate the db based on your current text file, but it shouldnt be too much of a problem.

Link to comment
Share on other sites

Well, you can't use a space in a URL so your proposal is out of the question (you have to use %20 to represent a space). I'm also curious - are you really expecting user's to type that information into a URL (Me being a bad speller would probably type "Mathew" instead of "Matthew" and get no results)? Are you not going to provide some sort of user interface for the user to get the data they want?

 

well what I'm making is for developers. I'm basically making an api, that allows developers that to retrieve scripture text into their application. So my application doesn't have a user interface. The reason I have spaces in the url, is because I copied the same format from biblegateway. When you type in a search query the url turns to this: http://www.biblegateway.com/passage/?search=matthew%201:1-4,%20luke%207:1-10, but I can just eliminate the space (i guess, to make things easier), but whats the best way to split the search string into the arrays that I need. Its a little tricky, because there can be many search queries not just 3. After every comma, they can type more text references to query. In my code, I explode the query based on the "," but i'm not sure how to split it further to get the book name, chapter and verses for each of the queries...

Link to comment
Share on other sites

Eventhough, all those variables would be needed if u have a category>subcategory system. In this case just an id=xx (the id of the phrase) would be enough. If u want to have a search page, make it search the phrases content. Or if u want someone to search on book numbers, verse numbers etc, add those columns to the db so each phrase is associated with its book and verse number. It may be a bit difficult to populate the db based on your current text file, but it shouldnt be too much of a problem.

 

Here is the structure:

Table 1

id | BookId | Book_Name |

 

Table 2

id | book_id | Chapter | verse_number | verse_text |

 

in the second table is where all the verses are stored.

Link to comment
Share on other sites

If this is for developers then you should not need logic to parse the value based upon "unsupported" formats. The developers should simply read the documentation and program their apps accordingly. Just use two different delimeters. For example you could use a dash to delimit books and commas to delimit "bookname, chapter, verse"

 

www.mydomain.com/index.php?ver=eng&books=genesis,1,5-matthew,1,10

 

In your code you could get the date like so

<?php

$version = $_GET_['ver'];
$books = explode('-', $_GET['books']);

foreach ($books as $book) {

  $bookData = explode(',', $book);
  $bookName = $bookData[0];
  $bookChapt = $bookData[1];
  $bookVerse = $bookData[2];

  //Insert code to retrieve data from DB and do something with it
}

?>

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.