Chapter 21: Regular Expressions
Time: 90 min | Audience: Intermediate-Advanced | Prerequisites: Chapter 06
Learning Outcomes
After this chapter, you will: - Understand regex syntax and pattern construction - Use character classes, quantifiers, and anchors - Apply regexes to validation, extraction, and transformation - Avoid common regex pitfalls - Know when regex is the right (and wrong) tool
Overview: Pattern Matching with Regular Expressions
Regular expressions (regexes) are patterns for matching strings. They're incredibly powerful but also a frequent source of confusion and bugs. This chapter covers practical regex usage for real-world tasks.
Zebra's regex engine uses Thompson NFA with Laurikari for proper unicode support—fast, correct, and predictable.
Key principle: Regexes are for pattern matching, not parsing. Use a real parser for structured data (XML, JSON, code).
Regex Basics
Literal Characters
The simplest regex is just literal characters:
// file: regex-literals.zbr
// teaches: basic regex literal matching // chapter: 21
def main() var text = "The cat sat on the mat" var pattern = "cat"
if text.matches(pattern) println("Pattern found!")
// Case-sensitive if not text.matches("CAT") println("'CAT' doesn't match 'cat'")
// Substring matching if text.contains("sat") println("'sat' is in the text")
// Note: for simple substring matching, use .contains() // Don't overcomplicate with regex
// Finding position var pos = text.indexOf("mat") # 19 if pos >= 0 println("Found at position ${pos}")
The Dot (.) Wildcard
The dot matches any single character except newline:
// file: regex-dot.zbr
// teaches: dot wildcard in regex patterns // chapter: 21
def main() var re = Regex.compile("c.t")
// Matches: cat, cot, cut, c9t, c t if re.matches("cat") println("Matches 'cat'")
if re.matches("cot") println("Matches 'cot'")
if re.matches("cut") println("Matches 'cut'")
if not re.matches("coat") # 'oa' is two chars, not one println("Doesn't match 'coat'")
// Practical: match email-ish pattern (simplified) var email_pattern = ".+@.+" var email_re = Regex.compile(email_pattern)
if email_re.matches("user@example.com") println("Valid email pattern")
Character Classes
Character classes match one character from a set:
// file: regex-character-classes.zbr
// teaches: character classes and ranges // chapter: 21
def main() // Single character from a set var re1 = Regex.compile("[aeiou]") # Match any vowel
if re1.matches("a") println("'a' is a vowel")
if re1.matches("e") println("'e' is a vowel")
if not re1.matches("x") println("'x' is not a vowel")
// Character ranges var digit_re = Regex.compile("[0-9]") # Any digit
if digit_re.matches("5") println("'5' is a digit")
if not digit_re.matches("a") println("'a' is not a digit")
var letter_re = Regex.compile("[a-zA-Z]") # Any letter
if letter_re.matches("X") println("'X' is a letter")
// Negation: NOT in set var non_vowel_re = Regex.compile("[^aeiou]")
if non_vowel_re.matches("b") println("'b' is not a vowel")
if not non_vowel_re.matches("a") println("'a' is a vowel (excluded by ^)")
Common Character Classes (Shortcuts)
Zebra provides shortcuts for common patterns:
// file: regex-shortcuts.zbr
// teaches: common regex shortcuts // chapter: 21
def main() // \d = [0-9] = digit var digit_re = Regex.compile("\\d")
if digit_re.matches("7") println("Found digit")
// \w = [a-zA-Z0-9_] = word character var word_re = Regex.compile("\\w")
if word_re.matches("a") println("'a' is a word character")
if word_re.matches("_") println("'_' is a word character")
if not word_re.matches("-") println("'-' is not a word character")
// \s = whitespace (space, tab, newline) var space_re = Regex.compile("\\s")
if space_re.matches(" ") println("Space matches whitespace")
if space_re.matches("\t") println("Tab matches whitespace")
// Inverse (uppercase) // \D = not digit // \W = not word character // \S = not whitespace
var not_digit = Regex.compile("\\D")
if not_digit.matches("x") println("'x' is not a digit")
if not not_digit.matches("5") println("'5' is a digit (excluded by \\D)")
Quantifiers
Quantifiers specify how many times a pattern repeats:
// file: regex-quantifiers.zbr
// teaches: repetition quantifiers // chapter: 21
def main() // * = zero or more var re_star = Regex.compile("ab*c") // ac, abc, abbc, abbbc, etc.
if re_star.matches("ac") println("Matches 'ac' (zero b's)")
if re_star.matches("abc") println("Matches 'abc' (one b)")
if re_star.matches("abbbc") println("Matches 'abbbc' (three b's)")
if not re_star.matches("aXc") println("Doesn't match 'aXc' (X is not b)")
// + = one or more var re_plus = Regex.compile("ab+c") // abc, abbc, abbbc, etc. (NOT ac)
if not re_plus.matches("ac") println("Doesn't match 'ac' (need at least one b)")
if re_plus.matches("abc") println("Matches 'abc'")
if re_plus.matches("abbc") println("Matches 'abbc'")
// ? = zero or one var re_optional = Regex.compile("colou?r") // color or colour
if re_optional.matches("color") println("Matches 'color' (American spelling)")
if re_optional.matches("colour") println("Matches 'colour' (British spelling)")
if not re_optional.matches("coloor") println("Doesn't match 'coloor' (too many o's)")
// Exact count: {n} var re_exact = Regex.compile("a{3}") // exactly three a's
if re_exact.matches("aaa") println("Matches 'aaa'")
if not re_exact.matches("aa") println("Doesn't match 'aa'")
// Range: {n,m} var re_range = Regex.compile("a{2,4}") // 2 to 4 a's
if re_range.matches("aa") println("Matches 'aa'")
if re_range.matches("aaa") println("Matches 'aaa'")
if re_range.matches("aaaa") println("Matches 'aaaa'")
if not re_range.matches("aaaaa") println("Doesn't match 'aaaaa' (too many)")
Anchors
Anchors assert position, not content:
// file: regex-anchors.zbr
// teaches: position anchors in regex // chapter: 21
def main() // ^ = start of string var starts_with_hello = Regex.compile("^hello")
if starts_with_hello.matches("hello world") println("Matches: string starts with 'hello'")
if not starts_with_hello.matches("say hello") println("Doesn't match: 'hello' is not at start")
// $ = end of string var ends_with_txt = Regex.compile("\\.txt$")
if ends_with_txt.matches("document.txt") println("Matches: filename ends with .txt")
if not ends_with_txt.matches("document.txt.bak") println("Doesn't match: .txt is not at end")
// Combining ^ and $ var exact_pattern = Regex.compile("^[a-z]+$") // Only lowercase letters
if exact_pattern.matches("hello") println("Matches: all lowercase")
if not exact_pattern.matches("Hello") println("Doesn't match: has uppercase")
if not exact_pattern.matches("hello123") println("Doesn't match: has numbers")
// Word boundary: \b var word_boundary = Regex.compile("\\bhello\\b")
if word_boundary.matches("hello world") println("Matches: 'hello' is a word")
if not word_boundary.matches("helloworld") println("Doesn't match: 'hello' is part of 'helloworld'")
Groups and Alternation
Groups collect parts together, and alternation provides choices:
// file: regex-groups.zbr
// teaches: grouping and alternation patterns // chapter: 21
def main() // Alternation: | var greeting_re = Regex.compile("hello|hi|hey")
if greeting_re.matches("hello") println("Matches 'hello'")
if greeting_re.matches("hi") println("Matches 'hi'")
if greeting_re.matches("hey") println("Matches 'hey'")
if not greeting_re.matches("goodbye") println("Doesn't match 'goodbye'")
// Groups with quantifiers var repeating_group = Regex.compile("(ab)+") // ab, abab, ababab, etc.
if repeating_group.matches("ab") println("Matches 'ab'")
if repeating_group.matches("abab") println("Matches 'abab'")
if repeating_group.matches("ababab") println("Matches 'ababab'")
if not repeating_group.matches("aba") println("Doesn't match 'aba'")
// Optional group var optional_group = Regex.compile("colou?r|color") // Actually redundant—simpler: colou?r
if optional_group.matches("color") println("Matches 'color'")
if optional_group.matches("colour") println("Matches 'colour'")
Practical Validation Patterns
Email Validation
Warning: email validation is complex! This is a simplified pattern.
// file: regex-email.zbr
// teaches: email validation pattern (simplified) // chapter: 21
def main() // Very basic email pattern // In production, use an email verification service var email_pattern = Regex.compile("[a-z0-9]+@[a-z]+\\.[a-z]+")
if email_pattern.matches("user@example.com") println("Valid format")
if not email_pattern.matches("invalid.email@") println("Invalid: missing domain")
if not email_pattern.matches("no-at-sign.com") println("Invalid: no @ sign")
// Better validation: check length, etc. def is_valid_email(email as str) as bool // Must have @ and . if not email.contains("@") return false
var parts = email.split("@") if parts.count() != 2 return false // Multiple @ signs
var local = parts.at(0) var domain = parts.at(1)
if local.len == 0 or domain.len == 0 return false // Empty parts
if not domain.contains(".") return false // No TLD
return true
if is_valid_email("alice@example.com") println("Email looks valid")
Phone Number Validation
// file: regex-phone.zbr
// teaches: phone number pattern matching // chapter: 21
def main() // US format: 123-456-7890 var us_phone = Regex.compile("\\d{3}-\\d{3}-\\d{4}")
if us_phone.matches("555-123-4567") println("Valid US phone")
if not us_phone.matches("5551234567") // Missing dashes println("Invalid: wrong format")
// International: +1-234-567-8900 var intl_phone = Regex.compile("\\+\\d{1,3}-\\d{3}-\\d{3}-\\d{4}")
if intl_phone.matches("+1-555-123-4567") println("Valid international")
// Flexible: accept various formats def is_valid_phone_flexible(phone as str) as bool // Must have at least 10 digits var digits_only = phone.replace("-", "").replace(" ", "").replace("(", "").replace(")", "")
var digit_count = 0 for char in digits_only.split("") if Regex.compile("\\d").matches(char) digit_count = digit_count + 1
return digit_count >= 10 and digit_count <= 15
URL Validation
// file: regex-url.zbr
// teaches: URL pattern matching // chapter: 21
def main() // Basic HTTP(S) URL var url_pattern = Regex.compile("https?://[a-z0-9]+\\.[a-z0-9]+")
if url_pattern.matches("https://example.com") println("Valid HTTPS URL")
if url_pattern.matches("http://example.co.uk") println("Valid HTTP URL")
if not url_pattern.matches("ftp://example.com") println("Doesn't match: FTP not in pattern")
// More complete def is_valid_url(url as str) as bool if not url.startsWith("http://") and not url.startsWith("https://") return false
var after_protocol = url.substring(7, url.len) if after_protocol.len == 0 return false
// Must have at least one dot if not after_protocol.contains(".") return false
// No spaces if after_protocol.contains(" ") return false
return true
Finding and Extracting Patterns
Finding Matches
// file: regex-finding.zbr
// teaches: finding matches within text // chapter: 21
def main() var text = "The prices are: $10, $25, and $100"
// Find prices (simple pattern) var price_pattern = Regex.compile("\\$\\d+")
// Find first match if price_pattern.matches(text) println("Contains price pattern")
// Extract all prices var prices = List(str)()
// Manual extraction (since full regex API varies) var search_start = 0 while search_start < text.len var dollar_pos = text.indexOf("$", search_start) if dollar_pos < 0 break
var num_start = dollar_pos + 1 var num_end = num_start
while num_end < text.len var char = text.charAt(num_end) if Regex.compile("\\d").matches(char) num_end = num_end + 1 else break
var price = text.substring(dollar_pos, num_end) prices.add(price) search_start = num_end
println("Found prices:") for price in prices println(" ${price}")
Extracting from Structured Text
// file: regex-extract-structured.zbr
// teaches: extracting data from formatted text // chapter: 21
def extract_person_data(line as str) as HashMap(str, str)? // Expected format: Name | Age | Email var pattern = Regex.compile("^(.+)\\|(.+)\\|(.+)$")
// Simplified: just split by | var parts = line.split("|") if parts.count() != 3 return nil
var data = HashMap(str, str)() data.put("name", parts.at(0).trim()) data.put("age", parts.at(1).trim()) data.put("email", parts.at(2).trim())
return data
def main() var record = "John Smith | 30 | john@example.com"
var extracted = extract_person_data(record)
if extracted != nil println("Name: ${extracted.fetch("name")}") println("Age: ${extracted.fetch("age")}") println("Email: ${extracted.fetch("email")}")
Text Replacement with Patterns
Simple Replacement
// file: regex-replace.zbr
// teaches: pattern-based text replacement // chapter: 21
def main() var text = "The cat sat on the mat"
// Replace first occurrence of pattern var pattern = Regex.compile("at") var replaced = pattern.replace(text, "AT") println(replaced) // "The cAT sat on the mat"
// Replace all occurrences var all_replaced = pattern.replaceAll(text, "AT") println(all_replaced) // "The cAT sAT on the mAT"
// Case-insensitive replacement (if supported) var case_insensitive = text.lower().replace("cat", "dog") // Note: this loses original case
Data Transformation
// file: regex-transform.zbr
// teaches: using regex for data transformation // chapter: 21
def main() // Convert dates from MM/DD/YYYY to YYYY-MM-DD var date = "03/15/2025"
var parts = date.split("/") if parts.count() == 3 var month = parts.at(0) var day = parts.at(1) var year = parts.at(2)
var iso_date = "${year}-${month}-${day}" println(iso_date) // 2025-03-15
// Escape special characters def escape_html(text as str) as str var escaped = text.replace("&", "&") escaped = escaped.replace("<", "<") escaped = escaped.replace(">", ">") escaped = escaped.replace("\"", """) escaped = escaped.replace("'", "'") return escaped
var html_unsafe = "<script>alert('XSS')</script>" println(escape_html(html_unsafe))
Common Pitfalls
Greedy vs. Non-Greedy
// file: regex-greedy.zbr
// teaches: understanding greedy matching // chapter: 21
def main() // Greedy: matches as much as possible var text = "<name>John</name> and <name>Jane</name>"
// This is too greedy! var greedy = Regex.compile("<name>.*</name>") // Matches: <name>John</name> and <name>Jane</name> (TOO MUCH!)
// Better: be more specific var specific = Regex.compile("<name>[^<]+</name>") // Matches: <name>John</name> or <name>Jane</name> (correctly)
// For non-greedy, many regex engines use .*? (with ?) // Check Zebra's specific syntax for your version
Special Characters Need Escaping
// file: regex-escaping.zbr
// teaches: escaping special characters // chapter: 21
def main() // These characters have special meaning: // . ^ $ * + ? { } [ ] \ | ( )
// To match a literal dot var file_extension = Regex.compile("\\.txt$")
if file_extension.matches("document.txt") println("Matches text file")
// To match a literal dollar sign var price_pattern = Regex.compile("\\$[0-9]+")
if price_pattern.matches("$50") println("Matches price")
// To match a literal backslash var path_pattern = Regex.compile("C:\\\\Users") // Note: double backslash
if path_pattern.matches("C:\\Users") println("Matches Windows path")
Know Your Regex Dialect
Different tools support different features. Zebra uses Thompson NFA, which: - ✅ Supports basic patterns well - ✅ Has predictable performance (no catastrophic backtracking) - ⚠️ May not support all advanced features like lookahead
Check documentation for your version.
Practical Application: Log Analysis
// file: regex-log-analysis.zbr
// teaches: using regex for real log analysis // chapter: 21
def analyze_logs(filename as str) var result = File.read(filename) if result.isErr() println("Error: ${result.error()}") return
var content = result.value() var lines = content.split("\n")
var error_count = 0 var warning_count = 0 var error_lines = List(str)()
for line in lines if line.contains("[ERROR]") error_count = error_count + 1 error_lines.add(line) elif line.contains("[WARN]") warning_count = warning_count + 1
println("Log Analysis:") println(" Errors: ${error_count}") println(" Warnings: ${warning_count}")
if error_count > 0 println("\nErrors:") for error_line in error_lines println(" ${error_line}")
def main() analyze_logs("app.log")
Key Takeaways
1. Regex for Patterns — Use for validation, searching, and pattern-based extraction.
2. Not for Parsing — Use a real parser for JSON, XML, structured formats.
3. Be Specific — Avoid greedy patterns. Use character classes to narrow matches.
4. Test Thoroughly — Regex bugs are subtle. Test edge cases.
5. Document Your Patterns — Future you will thank you.
6. Simple First — Is .contains() sufficient? Use it instead of regex overhead.
Exercises
1. URL Extractor — Find all URLs in text matching http(s):// 2. Log Severity Counter — Count [ERROR], [WARN], [INFO] lines in a log file 3. Email List Validator — Read CSV, validate email column, report invalid entries 4. Phone Formatter — Read list of numbers in various formats, output consistent format 5. JSON Key Extractor — Extract all JSON key names from a file
What's Next
Chapter 22 covers FFI (Foreign Function Interface)—calling code written in other languages. Regexes are often used to parse data from external systems, making them a natural precursor.