Jump to content

Recommended Posts

My project I was attempting is an automatic php system which takes a user submitted powerpoint file, opens it and extracts keywords. These could then be added to a database and be made searchable.

 

What I have done so far, is open a powerpoint file in php.

 

split the data by carriage returns and spaces

removed parts under 5 characters and over 15.

removed words from string that contain characters other than A-Za-z

 

 

I started off with the original data extracted which was around 2mb of text.

 

After doing the steps above i'm left with 3.8kb of keywords from inside the powerpoint file.

 

But I still have "junk" pieces of data:

 

agilgdyj lhcls tdmqjx qokkjf xbxan bwpgt cmucj zqbbrp
bgac jwmlnx ufsxt photoshop maasp dpkka wpedb kczbb huqz
jjbnf jcztpf hqvoi ccpau vljxa pyfjm mhhvb hyfkk rdqhk liwip
aezeg ejyow jsuznw kuqsu syktt zctbq upijhvb lbdxn sewsgc 
tmdec ttjklz lrvht dqypb omqzpng decdj nskxjr mpcvx vccmath 
puddg laapcr uteqqo svzyo urgvub yvcro ihxjr ebvve rnngz zrudu
irtlb cddzo yqxgy ngzro aktpk yahuc ndelz uzqtk olpar aqrty 
bnxpuygd vltxj citpj ngwso beysx bexpq

 

Would there be any way further to remove these junk pieces?

Link to comment
https://forums.phpfreaks.com/topic/128700-removing-junk-words-from-strings/
Share on other sites

Why did you start a new thread ?

Last thread = http://www.phpfreaks.com/forums/index.php/topic,221199.0.html

thread before = http://www.phpfreaks.com/forums/index.php?topic=220819.0

 

personally i think your doing this all wrong.. either write a read that parse's the data correctly (will take a while) or use a DOM as Barand suggested

However:

 

<?php
$data = "agilgdyj lhcls tdmqjx qokkjf xbxan bwpgt cmucj zqbbrp
bgac jwmlnx ufsxt photoshop maasp dpkka wpedb kczbb huqz
jjbnf jcztpf hqvoi ccpau vljxa pyfjm mhhvb hyfkk rdqhk liwip
aezeg ejyow jsuznw kuqsu syktt zctbq upijhvb lbdxn sewsgc
tmdec ttjklz lrvht dqypb omqzpng decdj nskxjr mpcvx vccmath
puddg laapcr uteqqo svzyo urgvub yvcro ihxjr ebvve rnngz zrudu
irtlb cddzo yqxgy ngzro aktpk yahuc ndelz uzqtk olpar aqrty
bnxpuygd vltxj citpj ngwso beysx bexpq testing ";
$data = preg_replace('/\b((?:\w{0,4}|\w{16,}|[^\w]))\b/si', '', $data );
//will remove huqz & bgac, the rest is valid

?>

Why did you start a new thread ?

Last thread = http://www.phpfreaks.com/forums/index.php/topic,221199.0.html

thread before = http://www.phpfreaks.com/forums/index.php?topic=220819.0

 

personally i think your doing this all wrong.. either write a read that parse's the data correctly (will take a while) or use a DOM as Barand suggested

 

Because I someone suggested the thread had changed to a different topic, and should be changed, read the last post.

 

And i'm afraid i dont understand COM or have a windows server to host the file on.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.