Jump to content

[SOLVED] Remove words from String under/over a certain length?


Recommended Posts

Its an attempt to remove junk code and be left with plain words after opening a ppt file in php.

 

4kHgWSvfbs5lGv4skT4QPlxEiW1Z1cQK4vxuPKZfContentTypes.xmlPK1arels.relsPKGgLdrsshapexml.xmlPKvdrsdownrev.xmlPKLg DEnrico Andy EEE  WNff3PPT10eD @B uphP xv0e0eRectangle 4PKZfContentTypes.xmlMO 2WRcJF0iKvLw 9uSq:wGi KIc oVjTMRc042CMPka8DkHbL8e iKXN6rco4y@oPK1arels.relsj0qCNoK ILcXm0XFo0xMeXIN4aG2RKIZ 4M9ctBm:f@3nOrjxR0T0@WBL5vPKOdrsshapexml.xmlTn1G4nPUQSKiw@63bLI7nc.d0Z5d E5peE8j4.3Ulec5 DJ Wbhocrv7u EAeVr::aam@9@FaVGpA61j3x: V3K8str.xN SICv02  ZIFP 6zCmJyI0TZRUwiI52p0fD.ZS EJCRURyn3y4ljM qDNq9FgKIm h2QRlnNLTxUCgZQqjv75olRgzyf0l4KlmzTGJG:5zP JPjBjh nA7 PKdrsdownrev.xmlDK0Cx1tFmloBz DC6JQVpRiAJBs 7 qo32n:kocSywYKglO tPKZfContentTypes.xmlPK1arels.relsPKOdrsshapexml.xmlPKtdrsdownrev.xmlPKd0 g EBusiness ECommerce R x0e0eRectangle 5  @Yg  A narrow view of ebusiness is selling over the web is etailing Most web surfers have bought online UK: 76 Online retail sales continue to grow rapidly annual growth 2003: 51 2004: 24 2005: 22 2006: 33.4 2007: 54 2008: 28 expected and faster than traditional retail 45 growth Offline sales are influenced by the web and vice versa An alternative allembracing vision is: The use of Internet technologies eg the web to support the core activities of businesses and organisationsVcmcmA WNff3PPT10u. D @B  0 D0e0eRectangle 2d0 g GTypes of eBusiness x@H0e0eRectangle 3  @Yg B2C retailers content providers portals social networks etc. B2B eprocurement exchanges etc C2C auctions classifieds e.g. eBay P2P Kazaa BitTorrent MCommerce 3G WiFi iPhone Blackberry  WNff380PPT10.j @ xJ0e0eRectangle 2PKZfContentTypes.xmlMO 2WRcJF0iKvLw 9uSq:wGi KIc oVjTMRc042CMPka8DkHbL8e iKXN6rco4y@oPK1arels.relsj0qCNoK ILcXm0XFo0xMeXIN4aG2RKIZ 4M9ctBm:f@3nOrjxR0T0@WBL5vPK2drsshapexml.xmlTn1GmnPUQG4MxmcM 

 

If you know of any other way of doing it...

 

The code before it is:

 

<?php
$myFile = "test.ppt";
$fh = fopen($myFile, 'r');
$data = fread($fh, filesize($myFile));
fclose($fh);
//echo $theData;
$newdata = preg_replace("/[^a-zA-Z0-9\s.@:]/", "", $data);
echo $newdata;
?>

If you want to remove words from 5 letters less and 15 letters more from a file. Through multiple words like a document. I'd split (preg_split) the file or variable which will turn it into a array. (like $split[0][0]) Then useing a foreach loop running through all the arrays and taking out the 5 less chars and 15 more chars. By using strlen.

Right I still have data like this:

 

sRij@W@HFUn
IDATxyxUwMro@I
5oB @hEqw0L
4M3Ne2 O0CCm3v
o:y:XqrcisX
Xi TlH DOTm
qbN:lun I@J
xeYzAbhb
TnLXTtE
jPSVUnrpL5Ntx
21Aest
2XpgKSKi9.pSG
quoqfq
D ye2oQ4
CP0 F yw
iLG@J. PH5Ns
7z4s@5ZBP
Yf fNG
3AHiMx.I
GB:yM6lZIGREN
G0SE B4vDmzI
J lGvm
xNSAH 8DBZHX8
upXpH4@
4J04I:gKFL
@q bdl9M
t1sLlBcOo
3       0

 

Any suggestions on removing?

 

Should I have decoded the file somehow first on opening?

I have this now:

 

Not very efficient I know but here it is:

 

<?php
$myFile = "test.ppt";
$fh = fopen($myFile, 'r');
$data = fread($fh, filesize($myFile));
fclose($fh);
//echo $theData;
$data = utf8_decode($data);
//$newdata = preg_replace("/[^a-zA-Z0-9\s.@:]/", "", $data);
//echo $newdata;
//$new = preg_split('/ /', $newdata, -1, PREG_SPLIT_OFFSET_CAPTURE);

$pieces = explode(" ", $data);
//$pieces = preg_split("[/s]", $newdata);

//echo '1test1'.$pieces[0].'2test2';
$count = count($pieces);
//echo 'Result: '.$count;

$output = '';

$i = 0;

while($i < $count) {

if(strlen($pieces[$i]) > 4 && strlen($pieces[$i]) < 15 && strpos("$pieces[$i]","@") == FALSE && strpos("$pieces[$i]",".") == FALSE && strpos("$pieces[$i]",":") == FALSE) {
	$output = $output.' '.$pieces[$i];
	//echo $i.'<br>'.$pieces[$i];	
}
$i++;
}


//echo $output;
$newoutput = preg_replace("/[^a-zA-Z0-9\s.@:]/", "", $output);
//echo $newoutput;

$newpieces = explode(" ", $newoutput);
$count = count($newpieces);
$output2 = 'test';
$a = 0;
while($a < $count) {

if(strlen($newpieces[$a]) > 4 && strlen($newpieces[$a]) < 15) {
	$output2 = $output2.' '.$newpieces[$a];
	//echo $i.'<br>'.$pieces[$i];	
}
$a++;
}

echo $output2;

 

Output:

 

test bdbb2 IGsbE U4y 19 HP4S7 fFNYF QqOK1 Ssk6 IDATw KII1 8Ys6 H QEYV3 K0u G tdeSD i2WZK 8 IDATRPt xoKd6 b2IaM L TRhT Yne2F Click Master title Master styles Second level Third level Fourth level Fifth EBusiness bg1lt1 tx1dk1 bg2lt2 tx2dk2 hlinkhlink CwfP  Techniques Click Master title Master styles Second level Third level Fourth level Fifth bg1lt1 tx1dk1 bg2lt2 tx2dk2 hlinkhlink Master styles Second level Third level Fourth level Fifth 18AaR bg1lt1 tx1dk1 bg2lt2 tx2dk2 hlinkhlink 18AaR bg1lt1 tx1dk1 bg2lt2 tx2dk2 hlinkhlink EBusiness Gerding Gravell narrow bought retail sales faster retail sales vision retailers content providers portals social networks exchanges auctions classifieds eBay P2P Kazaa iPhone Objectives Learning understanding EBusiness theoretical issues practical use Covers Models EBusiness Development using Business Interchange Services Mobile Commerce Digital Signatures Electronic Payment Protocols Recommender Systems Smart Cards Software Agents Software Negotiation Computational markets auctions important techniques issues designing building modeling Ebusiness relevant technologies smart cards electronic payment coursework assignment instructions November 2008 85 hours answer three questions selection cover technical design implementation assuming already about relational databases networks distributed systems programming Outline Course Introduction Enrico Gerding Scriptin

The data is mostly clean now :) just a bit of junk at the beginning, also a few words are missing from the text, any idea why?

 

test bdbb2 IGsbE U4y 19 HP4S7 fFNYF QqOK1 Ssk6 IDATw KII1 8Ys6 H QEYV3 K0u G tdeSD i2WZK 8 IDATRPt xoKd6 b2IaM L TRhT Yne2F Click Master title styles Second level Third level Fourth level Fifth EBusiness bg1lt1 tx1dk1 bg2lt2 tx2dk2 hlinkhlink CwfP  Techniques 18AaR Gerding Gravell narrow bought retail sales faster vision retailers content providers portals social networks exchanges auctions classifieds eBay P2P Kazaa iPhone Objectives Learning understanding theoretical issues practical use Covers Models Development using Business Interchange Services Mobile Commerce Digital Signatures Electronic Payment Protocols Recommender Systems Smart Cards Software Agents Negotiation Computational markets important techniques designing building modeling

As long as it works for you. I told you I wasn't the best at preg. =p and I think my expression was wrong.

 

I stuck to my exploding, had to do it twice though, and still have data[] with spaces in it, why? I dont understand

Thats intentional, I want that space or all the words are joined together in the output.

 

What i mean is on the second explode:

 

<?php
$newpieces = explode(" ", $newoutput);
$count = count($newpieces);
$output2 = '';
$a = 0;
while($a < $count) {

if(strlen($newpieces[$a]) > 4 && strlen($newpieces[$a]) < 15 && strpos("$output2","$newpieces[$a]") == FALSE) {
	$newpieces[$a] = strtolower($newpieces[$a]);
	$output2 = $output2.' '.$newpieces[$a];
	//echo '<br>'.$a.$output2;
	echo '<br>$newpieces['.$a.'] = '.$newpieces[$a].'<br>';	
}
$a++;
}

 

Look at these:

 

$newpieces[257] = styles second

$newpieces[258] = level third

$newpieces[259] = level fourth

$newpieces[260] = level fifth

 

Why have those not been split?

Changing the line to this: $output2 = $output2.''.$newpieces[$a];

 

 

gives me this as an output:

 

Output is: avxdsfdbb2vqme0igsbeu4y 19rgqslza9f6 qqok1ri 4 ccci2idat1idatwkii1fdyllqeyv3htdesdri2zk 898s9er10a x1fxhirvuk1fwva8jb2iaml trhtp9uk0jaktenynfe2fclickmastertitlemasterstyles secondlevel thirdlevel fourthlevel fifthebusinessbg1lt1tx1dk1bg2lt2tx2dk2hlinkhlinktechniquesclickmastermasterstyles secondlevel thirdlevel fourthlevel fifthmasterstyles secondlevel thirdlevel fourthlevel fifthebusinessgerdinggravellnarrowboughtretailsalesfastervisionretailerscontentprovidersportalssocialnetworksexchanges auctionsclassifiedsebay p2p kazaaiphoneobjectiveslearningunderstandingebusinesstheoreticalissuespracticaluse 

<br><br>$newpieces[278] = master<br><br>$newpieces[279] = styles
second<br><br>$newpieces[280] = level
third<br><br>$newpieces[281] = level
fourth<br><br>$newpieces[282] = level
fifth<br><br>$newpieces[288] = master<br><br>$newpieces[289] = styles
second

 

That means the problem is hidden /n right?

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.