Regular Expressions
Regular expression is a special text string for describing a search pattern.
Why would you use a Regular Expression?
Pattern matching - RegeEx helps you find strings that match a specific pattern. Example - passwords, emails etc.
Text Searching and Extraction - Extract parts of strings that match a pattern. Example - google search
Text replacement and substitution - To find and replace text that match a patern. Example - find and replace in text editors
Efficient and Compact code - RegEx can replace complex string matching code with a single expression
Data cleaning - Format data by passing a RegEx and removing unwanted characters
Compatibility - RegEx is supported by most commonly used languages like python, C++, Java, Javascript etc.
Useful for large datasets - RegEx allows for quick parsing and scraping of large data files.
Flexibility - You can define a range of regular expressions from simple ones to highly complex ones.
Metacharacters
- Metacharacters are the building blocks of regular expressions
- Characters in RegEx are understood to be either a metacharacter with a special meaning or a regular character with a literal meaning.
Cardinality
Metacharacter | Usage | Example | Explanation | Matching Strings |
---|---|---|---|---|
* | Zero or more of the preceding character | t* | Any word that has 0 or more occurrences of the letter t. | tttttttt, aaaaaa, tattaabbb |
? | Zero or one of the preceding character | t? | Any word that has 0 or 1 occurrences of the letter t. | tttttttt, aaaaaa, tattaabbb |
+ | One or more of the preceding character | t+ | Any word that has 1 or more occurrences of the letter t. | tttttttt, aaaaaat, tattaabbb |
{m, n} | The preceding character appears m to n times | A{1,5} | Any string that has 1-5 occurrences of A. | Apple, AAAAAAAAAA, bAll |
Range of values
Metacharacter | Usage | Example | Explanation | Matching Strings |
---|---|---|---|---|
[ ] | Any enclosed character | [A-Z], [0-9], [ABC] | Any value in the range specified. [A-Z] denotes any capitalized letter. [0-9] denotes any digit. [ABC] denotes any one of the 3 letters | [A-Z] --> B, C, D etc. [0-9] --> 1,2,5 etc. [ABC] --> A, B, or C |
Anchors
Metacharacter | Usage | Example | Explanation | Matching Strings |
---|---|---|---|---|
^ | Anchor - the beginning of a string | ^The | Any string that starts with The. | There, The, Theory, They’re |
$ | Anchor - the end of a string | coffee$ | a string that ends in with coffee | I am going to get myself a cup of coffee |
\b or \w | Anchor - the beginning/end of a word | \bThis\b | Any string that has the exact word This. | This is a book. |
Sequence of Characters
Metacharacter | Usage | Example | Explanation | Matching Strings |
---|---|---|---|---|
() | The matching sequence of characters | (ab), (wh) | Any string that has the specified sequence of characters will match | (ab) --> about (wh) --> what, where, why |
Wildcard Character
Metacharacter | Usage | Example | Explanation | Matching Strings |
---|---|---|---|---|
. | Any one character | a.t | A word that has a 3-char substring a(any 1 character)t. | act, pants, react |
Miscellaneous
Metacharacter | Usage |
---|---|
\ | Escape character |
[[:blank:]] | Space or tab |
| | Boolean OR |
Developing regular expressions - Hands on activity
Often when faced with a tricky regex problem it's easiest to start in a dedicated regex editor, such as RegExr and regex101 to develop a working regular expression. This section will walk you through this process with a simple example extracting information from the CU Boulder Computer Science Course Catalog.
This catalog file is somewhat large, so you should only copy the first 10-20 lines into RegExr. Just like any other testing, you'll want positive test cases (examples that match) and negative test cases (examples that don't match). The file is named course-catalog.txt
in your repository. This is a slightly cleaned up version of the course catalog in text format, which should make things a bit easier for us.
Before testing your expression, please enable the "Multiline" flag on RegExr as shown below.
Activity:
Our goal will be to count the number of courses which have an odd course number. This course, CSCI 3308, is an even course. Any ideas on how to get started? Here are some tips.
Find a "location" in each example that contains a reliable indicator of course number for all courses. How about the first line?
CSCI 1000 (1) Computer Science as a Field of Work and Study
Introduces curriculum, learning techniques, time management and career opportunities in Computer Science. Includes presentations from alumni and others with relevant educational and professional experience.
Equivalent - Duplicate Degree Credit Not Granted: CSPB 1000
Requisites: Restricted to students with 0-26 credits (Freshmen) Computer Science (CSEN-BSCS or CSEN-BA) or Engineering Open Option (XXEN) majors only.
Additional Information: Departmental Category: General Computer Science
This seems promising. We can use ^
to only match the start of each line, but would this rule out all false positives? What if the description started with the course number? These are all concerns your regex should address.
The 5 following letters are always the same CSCI
. Got any ideas on the numbers? As another hint, look at RegExr's cheatsheet on the left-hand side of the webpage. Some numbers can match any digit, while one must be limited. You may also want to look at [ ]
for enclosing a list or characters to match.
Note: You should refer to the metacharacters shared above in these notes.
Now let's see how it works in the terminal
Refer to the instructions here to connect to a linux terminal.
Once you are confident in your solution, you can move on to the terminal.
To test your regular expression we will use grep
. You need to replace the contents of the string with your regular expression from above.
For Linux or WSL2 terminals, to enable Posix, add -p option to grep command.
grep -E "<YOUR_REGULAR_EXPRESSION>" course-catalog.txt
CSCI 2275 (4) Programming and Data Structures
CSCI 2897 (3) Calculating Biological Quantities
CSCI 3155 (4) Principles of Programming Languages
...
Now you could count these manually, but regular expressions are particularly powerful when combined with simple linux tools such as wc -l
, which counts the number of lines. You can use a pipe operator to pipe the output of one command to another.
grep -E "<YOUR_REGULAR_EXPRESSION>" course-catalog.txt | wc -l
51
Did you get the right answer? If not, go back to RegExr and verify that the explanation given under "Tools" matches your intuition.