Apache Beam: Regex

·

1 min read

Overview

Do you have a custom DoFn that applies a regular expression (regex) pattern to the input element? Well, did you know there is a built-in Apache Beam transform called Regex which can simplify your code?

When You Should Use the Keys Transform

You should use the Regex transform if you want to perform the following for every element in a PCollection:

  1. Filter for strings matching a certain pattern

  2. Replace a pattern of a string with another string

  3. Split a string with a specific delimiter

How to Use the Keys Transform

Just apply the built-in transform to a PCollection of strings. Use Regex.Matches to filter input elements based on a certain regex pattern. Use Regex.ReplaceAll to replace substrings in a string. Use Regex.Split to split a string into multiple strings using a delimiter.

Example: Filter Strings with Email Regex Pattern

// Create a collection with strings
    PCollection<String> emails =
        pipeline.apply(
            Create.of(
                "johndoe@gmail.com",
                "sarahsmith@yahoo.com",
                "mikebrown@outlook.com",
                "amandajohnson",
                "davidlee",
                "emilyrodriguez"));

    // Take only strings which match the email regex
    PCollection<String> result =
        emails.apply(Regex.matches("([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6})"));

Conclusion

Check out other useful transforms from the official Apache Beam documentation.