Jump to content

Removing "junk" words from Strings


andrewgarn

Recommended Posts

My project I was attempting is an automatic php system which takes a user submitted powerpoint file, opens it and extracts keywords. These could then be added to a database and be made searchable.

 

What I have done so far, is open a powerpoint file in php.

 

split the data by carriage returns and spaces

removed parts under 5 characters and over 15.

removed words from string that contain characters other than A-Za-z

 

 

I started off with the original data extracted which was around 2mb of text.

 

After doing the steps above i'm left with 3.8kb of keywords from inside the powerpoint file.

 

But I still have "junk" pieces of data:

 

agilgdyj lhcls tdmqjx qokkjf xbxan bwpgt cmucj zqbbrp
bgac jwmlnx ufsxt photoshop maasp dpkka wpedb kczbb huqz
jjbnf jcztpf hqvoi ccpau vljxa pyfjm mhhvb hyfkk rdqhk liwip
aezeg ejyow jsuznw kuqsu syktt zctbq upijhvb lbdxn sewsgc 
tmdec ttjklz lrvht dqypb omqzpng decdj nskxjr mpcvx vccmath 
puddg laapcr uteqqo svzyo urgvub yvcro ihxjr ebvve rnngz zrudu
irtlb cddzo yqxgy ngzro aktpk yahuc ndelz uzqtk olpar aqrty 
bnxpuygd vltxj citpj ngwso beysx bexpq

 

Would there be any way further to remove these junk pieces?

Link to comment
https://forums.phpfreaks.com/topic/128700-removing-junk-words-from-strings/
Share on other sites

Why did you start a new thread ?

Last thread = http://www.phpfreaks.com/forums/index.php/topic,221199.0.html

thread before = http://www.phpfreaks.com/forums/index.php?topic=220819.0

 

personally i think your doing this all wrong.. either write a read that parse's the data correctly (will take a while) or use a DOM as Barand suggested

However:

 

<?php
$data = "agilgdyj lhcls tdmqjx qokkjf xbxan bwpgt cmucj zqbbrp
bgac jwmlnx ufsxt photoshop maasp dpkka wpedb kczbb huqz
jjbnf jcztpf hqvoi ccpau vljxa pyfjm mhhvb hyfkk rdqhk liwip
aezeg ejyow jsuznw kuqsu syktt zctbq upijhvb lbdxn sewsgc
tmdec ttjklz lrvht dqypb omqzpng decdj nskxjr mpcvx vccmath
puddg laapcr uteqqo svzyo urgvub yvcro ihxjr ebvve rnngz zrudu
irtlb cddzo yqxgy ngzro aktpk yahuc ndelz uzqtk olpar aqrty
bnxpuygd vltxj citpj ngwso beysx bexpq testing ";
$data = preg_replace('/\b((?:\w{0,4}|\w{16,}|[^\w]))\b/si', '', $data );
//will remove huqz & bgac, the rest is valid

?>

Why did you start a new thread ?

Last thread = http://www.phpfreaks.com/forums/index.php/topic,221199.0.html

thread before = http://www.phpfreaks.com/forums/index.php?topic=220819.0

 

personally i think your doing this all wrong.. either write a read that parse's the data correctly (will take a while) or use a DOM as Barand suggested

 

Because I someone suggested the thread had changed to a different topic, and should be changed, read the last post.

 

And i'm afraid i dont understand COM or have a windows server to host the file on.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.