Chapter 21: Regular Expressions

Time: 90 min | Audience: Intermediate-Advanced | Prerequisites: Chapter 06


Learning Outcomes

After this chapter, you will: - Understand regex syntax and pattern construction - Use character classes, quantifiers, and anchors - Apply regexes to validation, extraction, and transformation - Avoid common regex pitfalls - Know when regex is the right (and wrong) tool


Overview: Pattern Matching with Regular Expressions

Regular expressions (regexes) are patterns for matching strings. They're incredibly powerful but also a frequent source of confusion and bugs. This chapter covers practical regex usage for real-world tasks.

Zebra's regex engine uses Thompson NFA with Laurikari for proper unicode support—fast, correct, and predictable.

Key principle: Regexes are for pattern matching, not parsing. Use a real parser for structured data (XML, JSON, code).


Regex Basics

Literal Characters

The simplest regex is just literal characters:

// file: regex-literals.zbr

// teaches: basic regex literal matching // chapter: 21

def main() var text = "The cat sat on the mat" var pattern = "cat"

if text.matches(pattern) println("Pattern found!")

// Case-sensitive if not text.matches("CAT") println("'CAT' doesn't match 'cat'")

// Substring matching if text.contains("sat") println("'sat' is in the text")

// Note: for simple substring matching, use .contains() // Don't overcomplicate with regex

// Finding position var pos = text.indexOf("mat") # 19 if pos >= 0 println("Found at position ${pos}")

The Dot (.) Wildcard

The dot matches any single character except newline:

// file: regex-dot.zbr

// teaches: dot wildcard in regex patterns // chapter: 21

def main() var re = Regex.compile("c.t")

// Matches: cat, cot, cut, c9t, c t if re.matches("cat") println("Matches 'cat'")

if re.matches("cot") println("Matches 'cot'")

if re.matches("cut") println("Matches 'cut'")

if not re.matches("coat") # 'oa' is two chars, not one println("Doesn't match 'coat'")

// Practical: match email-ish pattern (simplified) var email_pattern = ".+@.+" var email_re = Regex.compile(email_pattern)

if email_re.matches("user@example.com") println("Valid email pattern")


Character Classes

Character classes match one character from a set:

// file: regex-character-classes.zbr

// teaches: character classes and ranges // chapter: 21

def main() // Single character from a set var re1 = Regex.compile("[aeiou]") # Match any vowel

if re1.matches("a") println("'a' is a vowel")

if re1.matches("e") println("'e' is a vowel")

if not re1.matches("x") println("'x' is not a vowel")

// Character ranges var digit_re = Regex.compile("[0-9]") # Any digit

if digit_re.matches("5") println("'5' is a digit")

if not digit_re.matches("a") println("'a' is not a digit")

var letter_re = Regex.compile("[a-zA-Z]") # Any letter

if letter_re.matches("X") println("'X' is a letter")

// Negation: NOT in set var non_vowel_re = Regex.compile("[^aeiou]")

if non_vowel_re.matches("b") println("'b' is not a vowel")

if not non_vowel_re.matches("a") println("'a' is a vowel (excluded by ^)")

Common Character Classes (Shortcuts)

Zebra provides shortcuts for common patterns:

// file: regex-shortcuts.zbr

// teaches: common regex shortcuts // chapter: 21

def main() // \d = [0-9] = digit var digit_re = Regex.compile("\\d")

if digit_re.matches("7") println("Found digit")

// \w = [a-zA-Z0-9_] = word character var word_re = Regex.compile("\\w")

if word_re.matches("a") println("'a' is a word character")

if word_re.matches("_") println("'_' is a word character")

if not word_re.matches("-") println("'-' is not a word character")

// \s = whitespace (space, tab, newline) var space_re = Regex.compile("\\s")

if space_re.matches(" ") println("Space matches whitespace")

if space_re.matches("\t") println("Tab matches whitespace")

// Inverse (uppercase) // \D = not digit // \W = not word character // \S = not whitespace

var not_digit = Regex.compile("\\D")

if not_digit.matches("x") println("'x' is not a digit")

if not not_digit.matches("5") println("'5' is a digit (excluded by \\D)")


Quantifiers

Quantifiers specify how many times a pattern repeats:

// file: regex-quantifiers.zbr

// teaches: repetition quantifiers // chapter: 21

def main() // * = zero or more var re_star = Regex.compile("ab*c") // ac, abc, abbc, abbbc, etc.

if re_star.matches("ac") println("Matches 'ac' (zero b's)")

if re_star.matches("abc") println("Matches 'abc' (one b)")

if re_star.matches("abbbc") println("Matches 'abbbc' (three b's)")

if not re_star.matches("aXc") println("Doesn't match 'aXc' (X is not b)")

// + = one or more var re_plus = Regex.compile("ab+c") // abc, abbc, abbbc, etc. (NOT ac)

if not re_plus.matches("ac") println("Doesn't match 'ac' (need at least one b)")

if re_plus.matches("abc") println("Matches 'abc'")

if re_plus.matches("abbc") println("Matches 'abbc'")

// ? = zero or one var re_optional = Regex.compile("colou?r") // color or colour

if re_optional.matches("color") println("Matches 'color' (American spelling)")

if re_optional.matches("colour") println("Matches 'colour' (British spelling)")

if not re_optional.matches("coloor") println("Doesn't match 'coloor' (too many o's)")

// Exact count: {n} var re_exact = Regex.compile("a{3}") // exactly three a's

if re_exact.matches("aaa") println("Matches 'aaa'")

if not re_exact.matches("aa") println("Doesn't match 'aa'")

// Range: {n,m} var re_range = Regex.compile("a{2,4}") // 2 to 4 a's

if re_range.matches("aa") println("Matches 'aa'")

if re_range.matches("aaa") println("Matches 'aaa'")

if re_range.matches("aaaa") println("Matches 'aaaa'")

if not re_range.matches("aaaaa") println("Doesn't match 'aaaaa' (too many)")


Anchors

Anchors assert position, not content:

// file: regex-anchors.zbr

// teaches: position anchors in regex // chapter: 21

def main() // ^ = start of string var starts_with_hello = Regex.compile("^hello")

if starts_with_hello.matches("hello world") println("Matches: string starts with 'hello'")

if not starts_with_hello.matches("say hello") println("Doesn't match: 'hello' is not at start")

// $ = end of string var ends_with_txt = Regex.compile("\\.txt$")

if ends_with_txt.matches("document.txt") println("Matches: filename ends with .txt")

if not ends_with_txt.matches("document.txt.bak") println("Doesn't match: .txt is not at end")

// Combining ^ and $ var exact_pattern = Regex.compile("^[a-z]+$") // Only lowercase letters

if exact_pattern.matches("hello") println("Matches: all lowercase")

if not exact_pattern.matches("Hello") println("Doesn't match: has uppercase")

if not exact_pattern.matches("hello123") println("Doesn't match: has numbers")

// Word boundary: \b var word_boundary = Regex.compile("\\bhello\\b")

if word_boundary.matches("hello world") println("Matches: 'hello' is a word")

if not word_boundary.matches("helloworld") println("Doesn't match: 'hello' is part of 'helloworld'")


Groups and Alternation

Groups collect parts together, and alternation provides choices:

// file: regex-groups.zbr

// teaches: grouping and alternation patterns // chapter: 21

def main() // Alternation: | var greeting_re = Regex.compile("hello|hi|hey")

if greeting_re.matches("hello") println("Matches 'hello'")

if greeting_re.matches("hi") println("Matches 'hi'")

if greeting_re.matches("hey") println("Matches 'hey'")

if not greeting_re.matches("goodbye") println("Doesn't match 'goodbye'")

// Groups with quantifiers var repeating_group = Regex.compile("(ab)+") // ab, abab, ababab, etc.

if repeating_group.matches("ab") println("Matches 'ab'")

if repeating_group.matches("abab") println("Matches 'abab'")

if repeating_group.matches("ababab") println("Matches 'ababab'")

if not repeating_group.matches("aba") println("Doesn't match 'aba'")

// Optional group var optional_group = Regex.compile("colou?r|color") // Actually redundant—simpler: colou?r

if optional_group.matches("color") println("Matches 'color'")

if optional_group.matches("colour") println("Matches 'colour'")


Practical Validation Patterns

Email Validation

Warning: email validation is complex! This is a simplified pattern.

// file: regex-email.zbr

// teaches: email validation pattern (simplified) // chapter: 21

def main() // Very basic email pattern // In production, use an email verification service var email_pattern = Regex.compile("[a-z0-9]+@[a-z]+\\.[a-z]+")

if email_pattern.matches("user@example.com") println("Valid format")

if not email_pattern.matches("invalid.email@") println("Invalid: missing domain")

if not email_pattern.matches("no-at-sign.com") println("Invalid: no @ sign")

// Better validation: check length, etc. def is_valid_email(email as str) as bool // Must have @ and . if not email.contains("@") return false

var parts = email.split("@") if parts.count() != 2 return false // Multiple @ signs

var local = parts.at(0) var domain = parts.at(1)

if local.len == 0 or domain.len == 0 return false // Empty parts

if not domain.contains(".") return false // No TLD

return true

if is_valid_email("alice@example.com") println("Email looks valid")

Phone Number Validation

// file: regex-phone.zbr

// teaches: phone number pattern matching // chapter: 21

def main() // US format: 123-456-7890 var us_phone = Regex.compile("\\d{3}-\\d{3}-\\d{4}")

if us_phone.matches("555-123-4567") println("Valid US phone")

if not us_phone.matches("5551234567") // Missing dashes println("Invalid: wrong format")

// International: +1-234-567-8900 var intl_phone = Regex.compile("\\+\\d{1,3}-\\d{3}-\\d{3}-\\d{4}")

if intl_phone.matches("+1-555-123-4567") println("Valid international")

// Flexible: accept various formats def is_valid_phone_flexible(phone as str) as bool // Must have at least 10 digits var digits_only = phone.replace("-", "").replace(" ", "").replace("(", "").replace(")", "")

var digit_count = 0 for char in digits_only.split("") if Regex.compile("\\d").matches(char) digit_count = digit_count + 1

return digit_count >= 10 and digit_count <= 15

URL Validation

// file: regex-url.zbr

// teaches: URL pattern matching // chapter: 21

def main() // Basic HTTP(S) URL var url_pattern = Regex.compile("https?://[a-z0-9]+\\.[a-z0-9]+")

if url_pattern.matches("https://example.com") println("Valid HTTPS URL")

if url_pattern.matches("http://example.co.uk") println("Valid HTTP URL")

if not url_pattern.matches("ftp://example.com") println("Doesn't match: FTP not in pattern")

// More complete def is_valid_url(url as str) as bool if not url.startsWith("http://") and not url.startsWith("https://") return false

var after_protocol = url.substring(7, url.len) if after_protocol.len == 0 return false

// Must have at least one dot if not after_protocol.contains(".") return false

// No spaces if after_protocol.contains(" ") return false

return true


Finding and Extracting Patterns

Finding Matches

// file: regex-finding.zbr

// teaches: finding matches within text // chapter: 21

def main() var text = "The prices are: $10, $25, and $100"

// Find prices (simple pattern) var price_pattern = Regex.compile("\\$\\d+")

// Find first match if price_pattern.matches(text) println("Contains price pattern")

// Extract all prices var prices = List(str)()

// Manual extraction (since full regex API varies) var search_start = 0 while search_start < text.len var dollar_pos = text.indexOf("$", search_start) if dollar_pos < 0 break

var num_start = dollar_pos + 1 var num_end = num_start

while num_end < text.len var char = text.charAt(num_end) if Regex.compile("\\d").matches(char) num_end = num_end + 1 else break

var price = text.substring(dollar_pos, num_end) prices.add(price) search_start = num_end

println("Found prices:") for price in prices println(" ${price}")

Extracting from Structured Text

// file: regex-extract-structured.zbr

// teaches: extracting data from formatted text // chapter: 21

def extract_person_data(line as str) as HashMap(str, str)? // Expected format: Name | Age | Email var pattern = Regex.compile("^(.+)\\|(.+)\\|(.+)$")

// Simplified: just split by | var parts = line.split("|") if parts.count() != 3 return nil

var data = HashMap(str, str)() data.put("name", parts.at(0).trim()) data.put("age", parts.at(1).trim()) data.put("email", parts.at(2).trim())

return data

def main() var record = "John Smith | 30 | john@example.com"

var extracted = extract_person_data(record)

if extracted != nil println("Name: ${extracted.fetch("name")}") println("Age: ${extracted.fetch("age")}") println("Email: ${extracted.fetch("email")}")


Text Replacement with Patterns

Simple Replacement

// file: regex-replace.zbr

// teaches: pattern-based text replacement // chapter: 21

def main() var text = "The cat sat on the mat"

// Replace first occurrence of pattern var pattern = Regex.compile("at") var replaced = pattern.replace(text, "AT") println(replaced) // "The cAT sat on the mat"

// Replace all occurrences var all_replaced = pattern.replaceAll(text, "AT") println(all_replaced) // "The cAT sAT on the mAT"

// Case-insensitive replacement (if supported) var case_insensitive = text.lower().replace("cat", "dog") // Note: this loses original case

Data Transformation

// file: regex-transform.zbr

// teaches: using regex for data transformation // chapter: 21

def main() // Convert dates from MM/DD/YYYY to YYYY-MM-DD var date = "03/15/2025"

var parts = date.split("/") if parts.count() == 3 var month = parts.at(0) var day = parts.at(1) var year = parts.at(2)

var iso_date = "${year}-${month}-${day}" println(iso_date) // 2025-03-15

// Escape special characters def escape_html(text as str) as str var escaped = text.replace("&", "&amp;") escaped = escaped.replace("<", "&lt;") escaped = escaped.replace(">", "&gt;") escaped = escaped.replace("\"", "&quot;") escaped = escaped.replace("'", "&#39;") return escaped

var html_unsafe = "<script>alert('XSS')</script>" println(escape_html(html_unsafe))


Common Pitfalls

Greedy vs. Non-Greedy

// file: regex-greedy.zbr

// teaches: understanding greedy matching // chapter: 21

def main() // Greedy: matches as much as possible var text = "<name>John</name> and <name>Jane</name>"

// This is too greedy! var greedy = Regex.compile("<name>.*</name>") // Matches: <name>John</name> and <name>Jane</name> (TOO MUCH!)

// Better: be more specific var specific = Regex.compile("<name>[^<]+</name>") // Matches: <name>John</name> or <name>Jane</name> (correctly)

// For non-greedy, many regex engines use .*? (with ?) // Check Zebra's specific syntax for your version

Special Characters Need Escaping

// file: regex-escaping.zbr

// teaches: escaping special characters // chapter: 21

def main() // These characters have special meaning: // . ^ $ * + ? { } [ ] \ | ( )

// To match a literal dot var file_extension = Regex.compile("\\.txt$")

if file_extension.matches("document.txt") println("Matches text file")

// To match a literal dollar sign var price_pattern = Regex.compile("\\$[0-9]+")

if price_pattern.matches("$50") println("Matches price")

// To match a literal backslash var path_pattern = Regex.compile("C:\\\\Users") // Note: double backslash

if path_pattern.matches("C:\\Users") println("Matches Windows path")

Know Your Regex Dialect

Different tools support different features. Zebra uses Thompson NFA, which: - ✅ Supports basic patterns well - ✅ Has predictable performance (no catastrophic backtracking) - ⚠️ May not support all advanced features like lookahead

Check documentation for your version.


Practical Application: Log Analysis

// file: regex-log-analysis.zbr

// teaches: using regex for real log analysis // chapter: 21

def analyze_logs(filename as str) var result = File.read(filename) if result.isErr() println("Error: ${result.error()}") return

var content = result.value() var lines = content.split("\n")

var error_count = 0 var warning_count = 0 var error_lines = List(str)()

for line in lines if line.contains("[ERROR]") error_count = error_count + 1 error_lines.add(line) elif line.contains("[WARN]") warning_count = warning_count + 1

println("Log Analysis:") println(" Errors: ${error_count}") println(" Warnings: ${warning_count}")

if error_count > 0 println("\nErrors:") for error_line in error_lines println(" ${error_line}")

def main() analyze_logs("app.log")


Key Takeaways

1. Regex for Patterns — Use for validation, searching, and pattern-based extraction.

2. Not for Parsing — Use a real parser for JSON, XML, structured formats.

3. Be Specific — Avoid greedy patterns. Use character classes to narrow matches.

4. Test Thoroughly — Regex bugs are subtle. Test edge cases.

5. Document Your Patterns — Future you will thank you.

6. Simple First — Is .contains() sufficient? Use it instead of regex overhead.


Exercises

1. URL Extractor — Find all URLs in text matching http(s):// 2. Log Severity Counter — Count [ERROR], [WARN], [INFO] lines in a log file 3. Email List Validator — Read CSV, validate email column, report invalid entries 4. Phone Formatter — Read list of numbers in various formats, output consistent format 5. JSON Key Extractor — Extract all JSON key names from a file


What's Next

Chapter 22 covers FFI (Foreign Function Interface)—calling code written in other languages. Regexes are often used to parse data from external systems, making them a natural precursor.