Skip to main content

Command Palette

Search for a command to run...

Apache Beam: Filter

Use Apache Beam's built-in Filter Transform to Simplify your Pipelines

Updated
2 min read

Overview

Have you ever wanted to filter data from a PCollection?

With the Filter transform, you can!

Given a function that takes an input element and returns True or False, the Filter transform will only allow elements that return True to proceed through the pipeline and to the next transform. Also, if the elements of the input PCollection are Java Objects that are Comparable, you can use built-in Filters which will filter out elements based on the elements' natural ordering (see below for examples).

When You Should Use the Filter Transform

You should use the Filter transform whenever you need to clean a PCollection based on certain criteria.

How to Use the Filter Transform

Using this transform is quite easy. All you need is a function which returns a boolean (which includes lambda functions) for whether you want to keep the input element (True) or discard the element (False). It will then allow all elements which returned True to go to the next transform in the pipeline.

Example: Filtering with a function

PCollection<String> allStrings = Create.of("Hello", "world", "hi");
PCollection<String> longStrings = allStrings
    .apply(Filter.by(new SerializableFunction<String, Boolean>() {
      @Override
      public Boolean apply(String input) {
        return input.length() > 3;
      }
    }));

Example: Filtering with a Builtin Filter

PCollection<Long> numbers = Create.of(2L, 3L, 4L, 5L);
PCollection<Long> bigNumbers = numbers.apply(Filter.greaterThan(3));
PCollection<Long> smallNumbers = numbers.apply(Filter.lessThanEq(3));

Example: Filtering with a Lambda

PCollection<Long> numbers = Create.of(2L, 3L, 4L, 5L);
PCollection<Long> bigNumbers = numbers.apply(Filter.by(number -> number > 3));

Conclusion

Instead of writing a custom Do Function to filter elements in a PCollection, you should use the Filter Transform.

Check out other useful transforms from the official Apache Beam documentation.

Apache Beam and Google Cloud Dataflow

Part 11 of 12

Dive into the world of scalable data processing with our comprehensive series on Apache Beam and Google Cloud Dataflow.

Up next

Apache Beam: MapElements and FlatMapElements Transforms

Use Built-in Apache Beam functions to simplify your pipelines

More from this blog

Nikhil Rao's Blog

18 posts

For the love of Data