The Most Simplified Regular Expression Tutorial In The World

Tuesday 28 May 2013 04:50 PM

When I was new to programming, "regular expressions" is something that intimidated me for quite some time. All those squiggly lines, slashes and brackets did not made me fallen in love with it at the first sight. Lengthy and comprehensive regular expression tutorials made me shy away from it like it's a large kangaroo. But Regular expressions provide quite powerful search and replace function in ColdFusion, JavaScript, InDesign Server and even inside Dreamweaver. I found once you get to know it, It is not all that scary.

Regular expressions are like Italian cooking. You start with a few essential ingredients, but you can mix them up in a correct order to create a complex tasty dish. I do understand the difficulty and complexity of creating a comprehensive regular expression tutorial, also studying one of those can be difficult. You have to talk about how it works, why it works the way it does, and all other details. But if you are familiar with how to make a simple pasta dish, it would help you learn more complicated dishes later on. And just like in cooking, regular expressions have a bit of complexity you cannot learn right away by just studying a tutorial. Once you know the basics, you have to work on it until you start to understand the flavors.

Here I put together a very simple regular expression tutorial which you can complete by the time you finish cup of a coffee. This is not meant to be a complete regular expression tutorial nor even a perfectly accurate one, but a tutorial to get you started. I hope once you complete this, you won't be scared of regular expression (if you are like how I was) and find it to be fun, searching Google for more.

Click on any regular expression to see it in action and move the mouse over for an explanation.

Example:

Toffee Chocolate say cake coke couscouss cupcake 1234 !@#$ 1234-5678 1234-milk ice cream ice cream red cream red cream ice coffee milk coffee lollipop "People say nothing is impossible, but I do nothing every day." අ あ \w ^ ABRA'CADABRA

Tutorial:

Let's start with the simplest. To match a letter or a word directly:
a
cake
1234
-

Match Unicode characters by \u followed by the hexadecimal character position:
\u0D85 matches letter අ from Sinhala Unicode range.
\u3042 matches letterあ from Hiragana Unicode range.
(Unicode is the standard for universal characters)

Match digits in our document (0123456789):
\d

Matches none digits - everything else except 0123456789:
\D

Matchers word characters:
\w
Word characters include digits and underscore but not symbols/punctuation or whitespace. Can you guess the Regex for "non-word characters"? It is uppercase of the Regex Regex command:
\W

Matches whitespace. Whitespace includes space, tab, line feed, next line, etc.:
\s
Guess the Regex for none whitespace? It is: \S

Replete {n} of times. Match 4 digits:
\d{4} \d{4}-\d{4}
likewise four characters:
\w{4} \w{4} \w{4}
Try matching the number of whitespaces, none characters and none digits. We can use this syntax to repeat most of Regex.

Let’s match a minimum 6 characters, but not more than 8:
\w{6,8}

Match more than 6 characters:
\w{6,}

Match the beginning of the line with ^:
^\d{4}
4 digits at the beginning of a line
^\D{4}
^\w{4}

Match a line by matching the beginning and the end:
^\w{8}$

\b matches a word boundary. When it is in front, it matches the beginning of a word:
\b\w{4}
When it is at the end, it matches the end of a word: \w{4}\b
Place it at both ends to match a word. \b\w{4}\b
Like always, the uppercase of the same expression means the opposite: \B\w{4}

.(Period) matches any single character. Two of them equals, two characters:
r.d c..e

Pipe | creates "OR" conditions:
cake|1234

We can use brackets to group "OR" conditions:
1234-(5678|milk)

We can repeat groups too:
(cous){2}

? (a question mark) makes the preceding expression optional - run once or zero times:. :
(1234-)(milk)? (1234-)?(milk)
This is called a "lazy" search. Lazy searches are happy with a single match giving the next expression the chance, but "greedy" searchers keep on matching all the combinations before letting the next expression precede. The "lazy" kid eats a single ice cream, but the "greedy" kid eats everything.

Here is a very important feature of regular expressions. We can refer back to matches found by a group by calling the positions of that group:.
(\w{3} )(cream )\1
Keep in mind, we are not referring to the group, we are not asking for the group to repeat, but we are asking to find the match found by that group again. We can call any number of groups:
(\w{3} )(cream )\1\2 How this works:
When each group satisfies a match, it keeps a record. We can recall that history record by the position of it. Recording history is somewhat resource consuming; we can specifically ask not to record group matches by placing a ?: in the beginning of a group if there is no need for us to recall them, it improves performance: (?:\w{4}) (cous)\1
Since we ask the first group not to be captured, second group took the first position of the history records.

Match any one character from the list of characters:.
[abc] matches characters "a", "b" or "c"
[ABC] is the same, but uppercase. Regular expressions are case sensitive unless if you ask them not to be.
['"] search for single quotes or double quote
r[abcde]d

[a-z] any lower case characters between a to z
[A-Z] any upper case characters between A to Z
[A-Z0-9] any upper case characters between A to Z or digits between 0 to 9
You can have multiple brackets too. m[d-j][d-l][ukt]

We already used ^ to find beginning of a string. But, when we used it inside the square brackets, instead of matching the beginning of a string, it negates the condition. Some RegEx commands have more than one meaning depending on where we place it. :
[^a-z] returns everything but characters between "a" to "z"
^[^A-Z] beginning of a string that does not start with characters between uppercase "A" to "Z"
m[d-j][d-l][^XYZ]

+ Makes preceding expression repeat greedily until it won't find any more matches:.
C[^\d]+e
+ is a greedy match - it keep on eating all possible matches before give the next expression a change. Lazy matches stops once they find a match and give the next command a chance:
["][^"]+["]

* makes preceding expression repeat greedily until it won't find any more matches and it can also make preceding expression optional.;
re[123]*d 123 matches nothing, but * still returns "true" and lets the next command proceed by making [123] optional.
C[^\d]*e
[No digit match] is repeated until the end of the line because * greedily matches everything it can just like +.

?= is a positive look-ahead.
We used ^ to match beginning of a line, $ match end of a line, and \b for word boundary. Positive look-ahead helps you to create boundaries of your own and the boundary starts at the beginning of the look-ahead.
Match the word "ice" in "ice cream", but not in the "ice coffee". We can create a boundary for "ice cream" and search for "ice":
(?=ice cream)ice (?=cake)c (?=cola)co
We can use this to define an end position also:
ice(?= coffee) nothing(?= is)

?! is a negative look-ahead. This creates a boundary to avoid and the boundary starts at the beginning of the look-ahead value:
(?!nothing is)nothing
It matches the word "nothing" as long as it is not in "nothing is":
ice(?! coffee)
When used with other regexes, this is a very powerful feature of regular expression:
ice(?! \w{5} ) 1234(?!-\d{4}) \d{4}(?!-\d{4})
\W{2} let's build into \d{4}(?=\W{2})
.*day let's build into say(?!.*day)

We can use escape key to search for literal value of regular expressions. We use \w to match word characters. But when you want to search \w in your document, we can use escape just like JavaScript:.
\\w
[a-z] matches characters between "a" and "z". But how about if you want to match "a" or "-" or "z"?
[a\-z]
^ matches the beginning of a line, but if you want to search for ^ in your document, you can escape it in your regular expression:
\^

Posted by Saman W Jayasekara at Thursday 11 October 2012 01:14 PM . inDesign Server . This and That . JavaScript . ColdFusion

External Resources :

RegExr

5 Comments :

Pavan

Wednesday 17 April 2013 12:37 PM

Thanks dude. This is great. Probably the best beginners' tutorial on the net.

Rafael Lunardelli

Tuesday 27 November 2012 08:16 AM

Awesome .. simple and powerful tutorial about a such complex topic..

Saturday 13 October 2012 01:58 PM

I use http://www.gskinner.com/RegExr/

Mark Fuqua

Thursday 11 October 2012 10:01 PM

I'm sorry...where are the regular expressions? For instance, when I click on '1' I see the letter 'a', the word 'cake', the sequence '1234' and '-'. When I click on one of the four items, I see the results of running the regular expression, but I don't see the expression itself, which would be most helpful.

Forgive me if I am missing something obvious. I am using Chrome.

Mark

Sam

Thursday 11 October 2012 10:27 PM

Mark, The word 'cake' or letter 'a' itself is part of regular expression too, even though it is extremely simple. It is important to start with the most simplest so we can use it in mix with others later on. Thank you for asking the question and so I can clear up for others.

By the way just to be sure, do you see other pages than page [1]. This have 26 pages and each page have different tutorial, gradually getting complex.

Drop me a Note