andrewgarn Posted October 16, 2008 Share Posted October 16, 2008 My project I was attempting is an automatic php system which takes a user submitted powerpoint file, opens it and extracts keywords. These could then be added to a database and be made searchable. What I have done so far, is open a powerpoint file in php. split the data by carriage returns and spaces removed parts under 5 characters and over 15. removed words from string that contain characters other than A-Za-z I started off with the original data extracted which was around 2mb of text. After doing the steps above i'm left with 3.8kb of keywords from inside the powerpoint file. But I still have "junk" pieces of data: agilgdyj lhcls tdmqjx qokkjf xbxan bwpgt cmucj zqbbrp bgac jwmlnx ufsxt photoshop maasp dpkka wpedb kczbb huqz jjbnf jcztpf hqvoi ccpau vljxa pyfjm mhhvb hyfkk rdqhk liwip aezeg ejyow jsuznw kuqsu syktt zctbq upijhvb lbdxn sewsgc tmdec ttjklz lrvht dqypb omqzpng decdj nskxjr mpcvx vccmath puddg laapcr uteqqo svzyo urgvub yvcro ihxjr ebvve rnngz zrudu irtlb cddzo yqxgy ngzro aktpk yahuc ndelz uzqtk olpar aqrty bnxpuygd vltxj citpj ngwso beysx bexpq Would there be any way further to remove these junk pieces? Quote Link to comment https://forums.phpfreaks.com/topic/128700-removing-junk-words-from-strings/ Share on other sites More sharing options...
prexep Posted October 16, 2008 Share Posted October 16, 2008 In the whole source are there even keywords possible to find? Quote Link to comment https://forums.phpfreaks.com/topic/128700-removing-junk-words-from-strings/#findComment-666972 Share on other sites More sharing options...
MadTechie Posted October 16, 2008 Share Posted October 16, 2008 Why did you start a new thread ? Last thread = http://www.phpfreaks.com/forums/index.php/topic,221199.0.html thread before = http://www.phpfreaks.com/forums/index.php?topic=220819.0 personally i think your doing this all wrong.. either write a read that parse's the data correctly (will take a while) or use a DOM as Barand suggested Quote Link to comment https://forums.phpfreaks.com/topic/128700-removing-junk-words-from-strings/#findComment-666973 Share on other sites More sharing options...
MadTechie Posted October 16, 2008 Share Posted October 16, 2008 However: <?php $data = "agilgdyj lhcls tdmqjx qokkjf xbxan bwpgt cmucj zqbbrp bgac jwmlnx ufsxt photoshop maasp dpkka wpedb kczbb huqz jjbnf jcztpf hqvoi ccpau vljxa pyfjm mhhvb hyfkk rdqhk liwip aezeg ejyow jsuznw kuqsu syktt zctbq upijhvb lbdxn sewsgc tmdec ttjklz lrvht dqypb omqzpng decdj nskxjr mpcvx vccmath puddg laapcr uteqqo svzyo urgvub yvcro ihxjr ebvve rnngz zrudu irtlb cddzo yqxgy ngzro aktpk yahuc ndelz uzqtk olpar aqrty bnxpuygd vltxj citpj ngwso beysx bexpq testing "; $data = preg_replace('/\b((?:\w{0,4}|\w{16,}|[^\w]))\b/si', '', $data ); //will remove huqz & bgac, the rest is valid ?> Quote Link to comment https://forums.phpfreaks.com/topic/128700-removing-junk-words-from-strings/#findComment-666984 Share on other sites More sharing options...
andrewgarn Posted October 16, 2008 Author Share Posted October 16, 2008 Why did you start a new thread ? Last thread = http://www.phpfreaks.com/forums/index.php/topic,221199.0.html thread before = http://www.phpfreaks.com/forums/index.php?topic=220819.0 personally i think your doing this all wrong.. either write a read that parse's the data correctly (will take a while) or use a DOM as Barand suggested Because I someone suggested the thread had changed to a different topic, and should be changed, read the last post. And i'm afraid i dont understand COM or have a windows server to host the file on. Quote Link to comment https://forums.phpfreaks.com/topic/128700-removing-junk-words-from-strings/#findComment-667179 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.