Jump to content

Breaking long texts into 1000 word segments


GeeDeezy

Recommended Posts

I have long plain text files that I would break into segments of say 1000 words, then insert each segment into a table in a DB, using PHP.

 

I can handle the DB part, but I am at a loss at figuring out how to break a long body of text into shorter segments.

 

In my head, the logic would look like this:

 

Open text file for reading, start at beginning

Loop

Get next 1000 words, put them in a variable.

Create new DB record and insert them into the record in a specific field.

end Loop

Close text file

 

Any help available?

There are some easy solutions, but the problem is what logic would you use to determine what a "word" is. The easiest solution would be to explode() the string using spaces, then use array_chunk() to create elements with 1,000 elements each and implode those back with spaces.

 

$words = explode(' ', $originalString);
$wordChunks = array_chunk($words, 1000);
foreach($wordChunks as &$words)
{
   $words = implode(' ', $words);
}

// $wordChunks is an array of 1000 word strings

 

But, as stated above - it depends what you consider a word. This could create some differences from what you expect.

Thanks. I would consider a word to be any set of 1 or more characters that end with a {space} or {carriage return}.

 

Then that will be more difficult to implement. Using just spaces to determine words would be very close. If your intent is just to get the content broken out into pieces that are relatively 1,000 words I would think that would suffice. But, if you want something that will split into exactly 1,000 words based upon specific requirements you will have more work to do. I guess preg_split() using space or line break for the split expression would be a good option.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.