MuddyW Posted Monday at 05:18 PM Share Posted Monday at 05:18 PM (edited) Hello Regexperts, I have pulled a regex from the internet to seperate words that start with capital letters. The regex is here TEXTJOIN(" ",,REGEXEXTRACT(R158,REGEXREPLACE(R158,"([A-Z][a-z]+)","($1)"))) When I have this: HummelfilmTamtam FilmThe Imaginarium Films The Regex works and returns this: Hummelfilm Tamtam Film The Imaginarium Films But when I have this Nordisk Film ProductionNadcon FilmZweites Deutsches Fernsehen (ZDF) or this Zentropa International SwedenCanal+Ciné+ The Regex breaks. It looks like the presence of non alphanumeric characters is causing the problem. Can anyone please help me out and rewrite the Regex so that it works with non alphanumeric characters too? Edited Monday at 05:31 PM by MuddyW Clarity Quote Link to comment https://forums.phpfreaks.com/topic/328847-split-words-at-capital-letters/ Share on other sites More sharing options...
requinix Posted Tuesday at 01:29 AM Share Posted Tuesday at 01:29 AM I was following along until the "please do this for me" bit at the end. REGEXREPLACE + REGEXEXTRACT like that is silly. Not sure where you got it from, but a single REGEXEXTRACT is enough to extract all the <uppercase letter + stuff>s in the cell. Check the documentation/help docs for how to make it match everything instead of just once (which is what it does by default). For the regex part, it's currently doing <uppercase letter + lowercase letters> so naturally it will only work with lowercase letters and not with numbers or symbols. If you want to match things that look like <uppercase letter + stuff> then realize that "stuff" is a less precise way of saying "keep going until the next uppercase letter". Or in other words, "anything that isn't uppercase". Because computers need you to be precise if you want them to work a certain way. Excel's regex syntax for "anything that isn't ___" is the same as everybody else does it, so you can check either the docs (which I'm sure include a little bit of regular expression syntax) or virtually any other regex resource to find how to write that. Quote Link to comment https://forums.phpfreaks.com/topic/328847-split-words-at-capital-letters/#findComment-1655199 Share on other sites More sharing options...
gizmola Posted 16 hours ago Share Posted 16 hours ago It would be good to actually explain where you are using this regex. Looks like it's in a spreadsheet. Regex engines can have different syntax and capabilities. You also provided examples of strings that I guess don't work right, but you didn't include the output you expect. That is important information to include in a question like this. The core of this is very simple. [A-Z][a-z]+ Things inside a [] pair are called character classes. So this means: Match any uppercase character -> [A-Z]. This will be a single match. Then match any lowercase character [a-z]. The "+" following is a quantifier which means "1 or more times". So for a match to be made, it requires at least 1 lowercase letter. So the obvious problem with this example: Zentropa International SwedenCanal+Ciné+ Is that it has a plus sign. That could be fixed by this: [A-Z][a-z+]+ However, the non - obvious problem is that you have a non-ascii character in Ciné, which wlll also not match. I am going to make an assumption here that you're using excel, and that it supports .NET's regex library. So by substituting a unicode specific character class that matches any "lowercase" unicode character, as well as allowing a + sign to be part of a string this would work: [A-Z](\p{Ll}|[+])+ I don't know if these are the only strings you have that are problematic, as company names can have all sorts of other non-ascii characters you might have to deal with. Which brings us to this: Nordisk Film ProductionNadcon FilmZweites Deutsches Fernsehen (ZDF) I assume the problem is that nothing will match the (ZDF). This is really a weakness of the approach. A better approach for this would be: Parse original string into an array using the space as a delimitter For each element in the array, perform the regex replacement that finds capital letters and adds a space to break it up into multiple words Rejoin all the elements in the array using a space This would fix the problem with the ZDF as well as any similar issues, as the regex replacement would not affect existing "words" like the "(ZDF)". I hope this helps you. Vibe coding/copy paste only gets you so far when you aren't able to study and understand how the code works, and whether or not it is applicable to your problem. Quote Link to comment https://forums.phpfreaks.com/topic/328847-split-words-at-capital-letters/#findComment-1655265 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.