Jump to content

Regex query - stripping away unwated text


iLuke

Recommended Posts

Hi,

 

I'm trying to use regular expressions to strip out a heap of text from a string, leaving just the data I want. The problem I'm having is that the data I want may not necessarily be the same each time, and may be different lengths, so I can't just tell the code to loop from a certain start position to a set end position.

 

What I have so far is this:

$data = array();
				$data[0] = "/^Your Chief of Intelligence dispatches [0-9]+ spies to attempt to sabotage [0-9]+ of [a-zA-Z-0-9\_\-]\'s weapons of $/" ;
				$data[1] = "/^type [A-Z]{1}[a-zA-Z0-9]+\.$/";
				$data[2] = "/^Your spies successfully enter [a-zA-Z-0-9\_\-]\'s armory undetected\, and$/";
				$data[3] = "/^destroy [0-9]+ of the enemy\'s [A-Z]{1}[a-zA-Z0-9]+ stockpile\. Your spies all return safely to your camp\.$/";
				$replacements = array();
				$replacements[0] = "/^[a-zA-Z-0-9\_\-]$/";
				$replacements[1] = "/^[A-Z]{1}[a-zA-Z0-9]$/";
				$replacements[2] = "/^[a-zA-Z-0-9\_\-]$/";
				$replacements[3] = "/^[0-9]+$/";
				echo preg_replace($data, $replacements, $sabData);

 

Typical data for this could be as follows:

Your Chief of Intelligence dispatches 1 spies to attempt to sabotage 3 of cain536's weapons of type Shield. Your spies successfully enter cain536's armory undetected, and destroy 3 of the enemy's Shield stockpile. Your spies all return safely to your camp.

 

Admittedly, it's my first proper attempt at using regex and it's more than a little confusing!

 

Essentially I am trying to pull the 4 bits of data (though 0 and 2 are the same technically) to use in my script later on.

 

Currently the output for this is just the same as the data you enter through the form.

I've been searching google for about an hour now, and I usually find the answers on this very forum! But today is an exception it seems. I've had a think about how I would do this conceptually, and I think I'm on the right lines, but there aren't any particularly helpful or clear regex tutorials on this type of thing out there =[

 

(That and PHP.nets own documentation tends to just scare me :D - but I read that too, to no avail)

 

Clueless and newbie I am! =[

 

Thanks,

Luke.

 

Link to comment
Share on other sites

Yeah, I had a look at that, but it's just that I want to pull more than one thing from the string, all of which are at different locations, and those locations can vary, so I didn't think preg_match would do it?

 

I had a look at preg_split, and it seemed like that might do it, but I think it'd put all of the values together, so I couldn't call them separately.

Link to comment
Share on other sites

Your pattern match is ok, but your replacement is foul.

I think you should use preg_match instead, to pull in the variables.

 

$data = "/.*? dispatches ([0-9]+) .*? sabotage ([0-9]+) of ([a-zA-Z-0-9\_\-]+)\'s .*" ;
$data .="type ([A-Z]{1}[a-zA-Z0-9]+)\.";
$data .= ".* enter ([a-zA-Z-0-9\_\-]+)\'s .*";
$data .= "destroy ([0-9]+) of the enemy\'s ([A-Z]{1}[a-zA-Z0-9]+)/";
$str="    Your Chief of Intelligence dispatches 1 spies to attempt to sabotage 3 of cain536's weapons of type Shield. Your spies successfully enter cain536's armory undetected, and destroy 3 of the enemy's Shield stockpile. Your spies all return safely to your camp.";
preg_match($data,$str,$intel);
var_dump($intel);

 

which should return an array of the items ya want extracted (element 0 of the array is your string, so that can be discarded)

 

 

Link to comment
Share on other sites

Thanks for the help!

 

I put that in, and it's echoing out:

 

"array(0) { } Array"

 

I'm really a bit of an amateur when it comes to regex - the syntax is quite tough to get the hang of, so thanks for helping me there =]

 

Unsure whether I needed to do anything else with the data after the code you put, but I just did:

$data = "/.*? dispatches ([0-9]+) .*? sabotage ([0-9]+) of ([a-zA-Z-0-9\_\-]+)\'s .*" ;
				$data .="type ([A-Z]{1}[a-zA-Z0-9]+)\.";
				$data .= ".* enter ([a-zA-Z-0-9\_\-]+)\'s .*";
				$data .= "destroy ([0-9]+) of the enemy\'s ([A-Z]{1}[a-zA-Z0-9]+)/";
				preg_match($data,$sabData,$intel);
				var_dump($intel);
				echo $intel;

 

To get the array stuff echoing. Oh, and $sabData is just the $_POST from the form where the user would paste in the string, which would be similar to the typical data in my first post.

 

Thanks again for your time! I honestly was stumped by this - it doesn't seem like an easy thing to do.

Link to comment
Share on other sites

regexp's usually take me ages to do, and i have to do them inside the code testing as i go too. i haven't done the time required to get a decent memory of all the syntax. sorry i couldnt be more helpful on this one. i knew preg_match was more suited to what you were doing but.

Link to comment
Share on other sites

$intel is an array. just depends how many groups you have. they are numbered from 0 to X (x is amount of groups in the pattern)

for the example here it would return an array like:

$intel[0] Your Chief of Intelligence dispatches 1 spies to attempt to sabotage 3 of cain536's weapons of type Shield. Your spies successfully enter cain536's armory undetected, and destroy 3 of the enemy's Shield

$intel[1] 1

$intel[2] 3

$intel[3] cain536

$intel[4] Shield

$intel[5] cain536

$intel[6] 3

$intel[7] Shield

 

This is reason I used var_dump (print_r would have worked as well)

 

Link to comment
Share on other sites

But when I printed say $intel[2] I just got the string.. in this case I did:

Your Chief of Intelligence dispatches 1 spies to attempt to sabotage 16 of Remco-MOD\'s weapons of type Lookout Tower. Your spies successfully enter Remco-MOD\'s armory undetected, and destroy 16 of the enemy\'s Lookout Tower stockpile. Your spies all return safely to your camp.

 

And this is what $intel[2] printed:

Your Chief of Intelligence dispatches 1 spies to attempt to sabotage 16 of Remco-MOD\'s weapons of type Lookout Tower. Your spies successfully enter Remco-MOD\'s armory undetected, and destroy 16 of the enemy\'s Lookout Tower stockpile. Your spies all return safely to your camp.

 

Just seems to \ out the ' and that's all.

 

Sorry if I'm being a total newbie and missing something =[

The code for that is:

$data = "/.*? dispatches ([0-9]+) .*? sabotage ([0-9]+) of ([a-zA-Z-0-9\_\-]+)\'s .*" ;
				$data .="type ([A-Z]{1}[a-zA-Z0-9]+)\.";
				$data .= ".* enter ([a-zA-Z-0-9\_\-]+)\'s .*";
				$data .= "destroy ([0-9]+) of the enemy\'s ([A-Z]{1}[a-zA-Z0-9]+)/";
				preg_match($data,$sabData,$intel);
				var_dump($intel);
				echo $intel[2];

 

Sorry to be a pain =[ I wish I could figure it out myself, but I just can't seem to figure out what do do there =[

 

Thank you so much for your time thus far!

Luke

 

 

 

Link to comment
Share on other sites

I still get outputted with

using only var_dump

# php pregvars.php

array(8) {

  [0]=>

  string(206) "    Your Chief of Intelligence dispatches 1 spies to attempt to sabotage 3 of cain536's weapons of type Shield. Your spies successfully enter cain536's armory undetected, and destroy 3 of the enemy's Shield"

  [1]=>

  string(1) "1"

  [2]=>

  string(1) "3"

  [3]=>

  string(7) "cain536"

  [4]=>

  string(6) "Shield"

  [5]=>

  string(7) "cain536"

  [6]=>

  string(1) "3"

  [7]=>

  string(6) "Shield"

}

 

$intel containing a total of 8 array elements. with element 0 being the entire string.

But if your original string has a \ in front of the ', than thats what is causing the failure, usually \' are added when you insert string items into your db,

u can fix that by either using string_replace, or modifying the preg patthern to include the \'.

which would look like: \\\' instead of \', the backslash in preg means its a direct character not to be confused with its metacharacters.

 

Link to comment
Share on other sites

Hmm, it's strange because even with all the \ taken away from the ' e.g in Username's (was Username\'s in the code),

 

when I submit the form and echo out echo $intel[2];, I am still getting the full string with the slashes in there.. even though they're no longer part of the regex, which means it must be putting them in there when it's submitting the form... The strange this is though there's no database involved. At present, all it's doing is checking if the form was submitted, and if so, running that code to split it into an array, so I can't see why it's picking up those \... unless it does that on form submit, in which case wont stripslashes work?

 

Hmm, very odd stuff... sorry to keep coming back to you with this, I'm just at a total loss as to how it's managing to putt stuff into the text after the user submits it, when the code isn't telling it to lol!

Link to comment
Share on other sites

Okay... that's officially bizarre! Used strip slashes and it still outputs this:

Your Chief of Intelligence dispatches 1 spies to attempt to sabotage 16 of Remco-MOD\'s weapons of type Lookout Tower. Your spies successfully enter Remco-MOD\'s armory undetected, and destroy 16 of the enemy\'s Lookout Tower stockpile. Your spies all return safely to your camp.

 

When I enter:

Your Chief of Intelligence dispatches 1 spies to attempt to sabotage 16 of Remco-MOD's weapons of type Lookout Tower.

Your spies successfully enter Remco-MOD's armory undetected, and destroy 16 of the enemy's Lookout Tower stockpile. Your spies all return safely to your camp.

 

It's mad.. dunno how it's doing that!

I mean, according to the array info you posted, array 2 would be 16 in this case, and instead it's returning the whole string * those few forward slashes. Gah!

 

Thanks again anyway, this is tough stuff!

Link to comment
Share on other sites

Right, done a little more on it, and now it's giving this error:

 

array(0) { }

Notice: Undefined offset: 0 in [LOCATION] on line 67

 

With the code:

				if (isset($_POST['submitX2'])) {
				$data = "/.*? dispatches ([0-9]+) .*? sabotage ([0-9]+) of ([a-zA-Z-0-9\_\-]+)'s .*" ;
				$data .="type ([A-Z]{1}[a-zA-Z0-9]+)\.";
				$data .= ".* enter ([a-zA-Z-0-9\_\-]+)'s .*";
				$data .= "destroy ([0-9]+) of the enemy's ([A-Z]{1}[a-zA-Z0-9]+)/";
				preg_match($data,$sabData,$intel);
				var_dump($intel);
				echo $intel[0];
			}

 

And the input data in SubmitX2 =

Your Chief of Intelligence dispatches 1 spies to attempt to sabotage 16 of Remco-MOD's weapons of type Lookout Tower.

Your spies successfully enter Remco-MOD's armory undetected, and destroy 16 of the enemy's Lookout Tower stockpile. Your spies all return safely to your camp.

 

At least it's giving an error now! :D

 

Link to comment
Share on other sites

Undefined offset: 0 in [LOCATION] on line 67

 

I assume line 67 is: echo $intel[0];

 

This probably means the preg_match is failing to find the data you want, I am unsure whether preg_match will still set the first index of the matches array if it fails to find any matches to your regexp in the string. Have you changed the regexp a bit?

Link to comment
Share on other sites

Nah, I left the regex in there - but I did that bit, so I've probably done something wrong. laffin did tinker with it a bit though, to get the syntax right.

 

But is regex smart enough to know that in the string I want to find, if I put say [0-9] (as well as the other stuff I don#t want) and then in the string I want to be left with e.g. just [0-9].. does regex realise that it's the same [0-9] from the original string, or does it just interpret that as being any string of numbers from zero to nine?

 

laffin said that the array was working for him though, and it was actually echoing out the values that I want, but when I use it, it doesn't seem to work - even though I've not changed anything.

Link to comment
Share on other sites

i'm not sure what you're explaining. The parts of the regexp that are in brackets: ( ) become the data that goes into the matches array. If you are putting ([0-9]) for example, it will find any character from 0 to 9. I'm not sure if it will find 11, 12, 13 etc. though. When I want digits to be found I usually use (\d*) (any number of digit characters).

 

I dont think this is what you're asking but I dont think i understand your question.

Link to comment
Share on other sites

Yeah, you answered what I was trying to ask anyway lol. I didn't explain too well :D

 

I just wish I could at least try to help myself a little with this, but I've got no idea why it wont work.

 

I googled the error I posted, and apparently that's caused because there's nothing stored in the array, which means that the preg_match is failing, but since I'm actually terrible with the regex syntax, I can't see what's wrong with it =[

 

Sorry to be a pain! .. I've never done regex before.. I've always just copied/pasted a regex string from the net before lol. But this is a little out of the ordinary and needed custom regex.

Link to comment
Share on other sites

when i first started trying regexps, i found it is easier to start with a simple regex, get it working, understand how it is working and then build on that piece by piece to get what you want. you will probably never understand regexps unless you 1. don't be afraid to just give something a go and see what happens and 2. read the manual sections on them about 20 times over as you try things. i cannot stress how good the php manual is for learning things. originally i learnt about regexps from perl and the man page on them for perl was really good too. php's preg_ functions and perl's regexps are practically the same.

Link to comment
Share on other sites

Well I did some more work on it, and I've almost got it working!

 

if (isset($_POST['submitX1']) || isset($_POST['submitX2'])) {
				$sabData = stripslashes($sabData);
				$data = "/.*? dispatches ([0-9]+) .*? sabotage ([0-9]+) of ([a-zA-Z-0-9\_\-]+)'s .*" ;
				$data .="([A-Z]{1}[a-zA-Z0-9]+) ([A-Z]{1}[a-zA-Z0-9]*)\.";
				$data .= ".* enter ([a-zA-Z-0-9\_\-]+)'s .*";
				$data .= "destroy ([0-9]+) of the enemy's ([A-Z]{1}[a-zA-Z0-9]+)/";
				preg_match($data,$sabData,$intel);					
			if (isset($_POST['submitX1'])) {
				$totalSabbed = (1000000 * $intel[7]);
				echo "<h1> Sabbotage Successfully Logged!</h1>";
				echo "<h3> View sab report below: </h3>";
				echo "<b>Target:</b> " . $intel[3]; 
				echo "<br />";
				echo "<b>Weapon Sabbed:</b> " . $intel[4] . " " . $intel[5];
				echo "<br />";
				echo "<b>Amount Sabbed:</b> " . $intel[7];
				echo "<br />";
				echo "<b>Total Sab Value:</b> " . $totalSabbed;
			} 

 

It works now IF I remove the line break from the original string after the first instance of the weapon type.

I know regex has ways of handling line breaks, and I tried putting in /n \n \/n and even  /\n lol, and none of them worked.

 

Also, while I'm sure the little problem described above is just my lack of experience in regex, this one is a little more complex I think... it breaks if I put in Shield as the weapon type, as it only has one word. I dunno why this is, though, as I put a * instead of a + in the code above.. which means zero or more right? So effectively the string reads: One word, beginning with a capital letter, followed by zero or more characters beginning with another capital letter. Dunno why that wont work though?

 

Anyway! Thanks again for your help on this! I'm getting there.. slowly :D I actually am understanding the string now and what it's all doing, which is how I've managed to figure out why it was breaking before.

 

Oh, and I got rid of the slashes it was adding just using stripslashes... though that didn't work before, but I'm sure not complaning!

Link to comment
Share on other sites

Okay, now I'm a stuck again. Same problem, but this is odd.

 

The raw string I input does have a linebreak after "weapons of type Lookout Tower.", when I paste it into the text box. However, when I echo it out after the form is submitted, that linebreak disappears, and the regex breaks (lots of undefined offset errors).

 

What's strange, though, is that if I manually delete the linebreak from the text box, the code works... even though the string that is being echoed out is exactly the same as the one as was echoed when the linebreak was left in... so how can the regex possibly break?

 

E.g.

Your Chief of Intelligence dispatches 1 spies to attempt to sabotage 16 of Remco-MOD's weapons of type Lookout Tower.

Your spies successfully enter Remco-MOD's armory undetected, and destroy 16 of the enemy's Lookout Tower stockpile. Your spies all return safely to your camp.

 

Echoes:

Your Chief of Intelligence dispatches 1 spies to attempt to sabotage 16 of Remco-MOD's weapons of type Lookout Tower. Your spies successfully enter Remco-MOD's armory undetected, and destroy 16 of the enemy's Lookout Tower stockpile. Your spies all return safely to your camp.

And the regex breaks (Undefined offset: 3 in.. etc)

 

Whereas the string:

Your Chief of Intelligence dispatches 1 spies to attempt to sabotage 16 of Remco-MOD's weapons of type Lookout Tower. Your spies successfully enter Remco-MOD's armory undetected, and destroy 16 of the enemy's Lookout Tower stockpile. Your spies all return safely to your camp.

 

Echoes:

Your Chief of Intelligence dispatches 1 spies to attempt to sabotage 16 of Remco-MOD's weapons of type Lookout Tower. Your spies successfully enter Remco-MOD's armory undetected, and destroy 16 of the enemy's Lookout Tower stockpile. Your spies all return safely to your camp.

 

.... And the regex works! But the string that is held in the $sabData variable (which is what is echoed out both times) is exactly the same! So I can't see why it'd break!

Link to comment
Share on other sites

That worked perfectly! Thanks!

 

I'd tried using str_replace to get rid of the linebreaks, but that failed. This works fine! Thanks again!

 

... One last thing then.

 

Is there a way of making the weapon type take on one array slot... even if it's one word in length or two?

At the minute

 

I am doing:

echo "<b>Weapon Sabbed:</b> " . $intel[4] . " " . $intel[5];

 

Which is fine for weapons such as Lookout Tower or Blackpowder Missile etc.

 

But some weapons can be Shield, Nunchaku etc.

 

Obviously the script will break when that is the case and I'm unsure how to make $intel[4] hold both Shield and Lookout Tower, so I can change my echo to

 

echo "<b>Weapon Sabbed:</b> " . $intel[4] ;

 

And it'll work =]

 

Thanks again!

Link to comment
Share on other sites

I gave that a go, by trying to group the two bits of regex I have to find the weapon currently, but that didn't work... and embarrassingly, I'm not sure how you tell regex that spaces are okay too =[

 

Gonna do some googling, though, and hopefully that'll be the whole thing nailed!

 

Thank you very much for your help! Couldn't have done it without you guys!

(Google is terrible.. not one site had that /s thing that you did, and all suggested doing string replaces and all sorts!)

 

Anyhow, thanks again!

Luke.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.