Apache Beam: Regex
Overview
Do you have a custom DoFn that applies a regular expression (regex) pattern to the input element? Well, did you know there is a built-in Apache Beam transform called Regex
which can simplify your code?
When You Should Use the Keys Transform
You should use the Regex
transform if you want to perform the following for every element in a PCollection:
Filter for strings matching a certain pattern
Replace a pattern of a string with another string
Split a string with a specific delimiter
How to Use the Keys Transform
Just apply the built-in transform to a PCollection of strings. Use Regex.Matches
to filter input elements based on a certain regex pattern. Use Regex.ReplaceAll
to replace substrings in a string. Use Regex.Split
to split a string into multiple strings using a delimiter.
Example: Filter Strings with Email Regex Pattern
// Create a collection with strings
PCollection<String> emails =
pipeline.apply(
Create.of(
"johndoe@gmail.com",
"sarahsmith@yahoo.com",
"mikebrown@outlook.com",
"amandajohnson",
"davidlee",
"emilyrodriguez"));
// Take only strings which match the email regex
PCollection<String> result =
emails.apply(Regex.matches("([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6})"));
Conclusion
Check out other useful transforms from the official Apache Beam documentation.