Apache Beam: Filter
Use Apache Beam's built-in Filter Transform to Simplify your Pipelines
Overview
Have you ever wanted to filter data from a PCollection?
With the Filter
transform, you can!
Given a function that takes an input element and returns True or False, the Filter transform will only allow elements that return True to proceed through the pipeline and to the next transform. Also, if the elements of the input PCollection are Java Objects that are Comparable
, you can use built-in Filters which will filter out elements based on the elements' natural ordering (see below for examples).
When You Should Use the Filter Transform
You should use the Filter
transform whenever you need to clean a PCollection based on certain criteria.
How to Use the Filter Transform
Using this transform is quite easy. All you need is a function which returns a boolean (which includes lambda functions) for whether you want to keep the input element (True) or discard the element (False). It will then allow all elements which returned True to go to the next transform in the pipeline.
Example: Filtering with a function
PCollection<String> allStrings = Create.of("Hello", "world", "hi");
PCollection<String> longStrings = allStrings
.apply(Filter.by(new SerializableFunction<String, Boolean>() {
@Override
public Boolean apply(String input) {
return input.length() > 3;
}
}));
Example: Filtering with a Builtin Filter
PCollection<Long> numbers = Create.of(2L, 3L, 4L, 5L);
PCollection<Long> bigNumbers = numbers.apply(Filter.greaterThan(3));
PCollection<Long> smallNumbers = numbers.apply(Filter.lessThanEq(3));
Example: Filtering with a Lambda
PCollection<Long> numbers = Create.of(2L, 3L, 4L, 5L);
PCollection<Long> bigNumbers = numbers.apply(Filter.by(number -> number > 3));
Conclusion
Instead of writing a custom Do Function to filter elements in a PCollection, you should use the Filter
Transform.
Check out other useful transforms from the official Apache Beam documentation.