Apache Beam: Filter

Use Apache Beam's built-in Filter Transform to Simplify your Pipelines

·

2 min read

Overview

Have you ever wanted to filter data from a PCollection?

With the Filter transform, you can!

Given a function that takes an input element and returns True or False, the Filter transform will only allow elements that return True to proceed through the pipeline and to the next transform. Also, if the elements of the input PCollection are Java Objects that are Comparable, you can use built-in Filters which will filter out elements based on the elements' natural ordering (see below for examples).

When You Should Use the Filter Transform

You should use the Filter transform whenever you need to clean a PCollection based on certain criteria.

How to Use the Filter Transform

Using this transform is quite easy. All you need is a function which returns a boolean (which includes lambda functions) for whether you want to keep the input element (True) or discard the element (False). It will then allow all elements which returned True to go to the next transform in the pipeline.

Example: Filtering with a function

PCollection<String> allStrings = Create.of("Hello", "world", "hi");
PCollection<String> longStrings = allStrings
    .apply(Filter.by(new SerializableFunction<String, Boolean>() {
      @Override
      public Boolean apply(String input) {
        return input.length() > 3;
      }
    }));

Example: Filtering with a Builtin Filter

PCollection<Long> numbers = Create.of(2L, 3L, 4L, 5L);
PCollection<Long> bigNumbers = numbers.apply(Filter.greaterThan(3));
PCollection<Long> smallNumbers = numbers.apply(Filter.lessThanEq(3));

Example: Filtering with a Lambda

PCollection<Long> numbers = Create.of(2L, 3L, 4L, 5L);
PCollection<Long> bigNumbers = numbers.apply(Filter.by(number -> number > 3));

Conclusion

Instead of writing a custom Do Function to filter elements in a PCollection, you should use the Filter Transform.

Check out other useful transforms from the official Apache Beam documentation.