Implementing simplify for starts_with in Apache DataFusion

blog link

Apache DataFusion is a Rust-native query engine with a powerful optimizer. One key component of its optimizer is expression simplification, often referred to as the ,simplify function. In this post, we'll walk through how we implemented a simplify rule for the starts_with string function in DataFusion, turning it into an equivalent LIKE pattern. We'll cover a high-level overview of DataFusion's simplify mechanism, why this change was needed, how we implemented it step by step (including basic regex handling and tests), and the benefits it brings to both the DataFusion project and Rust developers.

Note: Portions of this article were drafted with the help of an AI writing assistant, then heavily edited by me for clarity and accuracy.

What is DataFusion's simplify Function?

In DataFusion, simplify is an optimization mechanism that rewrites expressions into simpler or more efficient forms during query planning. This includes things like constant folding (e.g. replacing 1 + 2 with 3) and algebraic rewrites. Each built-in function or expression can provide its own simplify logic. Thanks to DataFusion's well-structured design, these simplification rules are modular – typically implemented as a method on the function's definition – making it easy to add new rules without touching the rest of the codebase. For example, the starts_with function in DataFusion has a fn simplify(...) method where we can plug in custom rewrite logic​. The optimizer automatically applies these rules, so if a function's arguments allow a simplification, DataFusion will use the simpler form in the query plan.

Why Simplify starts_with?

The starts_with(string, prefix) function returns true if string begins with the given prefix. Without simplification, starts_with would be treated as a black-box scalar function in the query plan. By introducing a simplification rule, we can rewrite starts_with(col, 'prefix') into a standard SQL pattern match col LIKE 'prefix%'.

This has big benefits:

SQL engines (and DataFusion) know how to optimize LIKE 'prefix%' patterns for string columns. In particular, converting to LIKE enables predicate pruning based on prefix filters​. In other words, DataFusion can use file metadata (like min/max column values) to skip reading data that can't match the prefix, significantly improving performance for queries that filter on string prefixes.

Thanks to the PR that implements predicate pruning: https://github.com/apache/datafusion/pull/12978

Another benefit is that the logical plan becomes more transparent. A LIKE 'prefix%' is a recognizable operation for developers and other optimizations, whereas a custom function call might be harder to leverage. Overall, simplifying starts_with makes the query plan more efficient and easier to reason about.

Implementing the Simplification for starts_with

Thanks to DataFusion's modular architecture, adding this optimization was straightforward. All the logic for starts_with is encapsulated in its own module, so we only needed to modify that and add tests. Here are the steps we took to implement the simplify rule for starts_with:

  1. Introducing the simplify logic: We added a new simplify method in the starts_with function implementation. This method inspects the function's arguments. If the second argument (the prefix) is a literal string, we can simplify. DataFusion provides an ExprSimplifyResult type to indicate whether an expression was simplified or left unchanged. In our case, when we detect a literal prefix, we'll return a Simplified result with a new expression.

  2. Converting to a LIKE expression (basic regex handling): If the prefix is literal, we construct a new pattern for a SQL LIKE. Essentially, we take the prefix and append a % wildcard to match any suffix. For example, starts_with(name, "Ja") becomes name LIKE "Ja%". We also had to handle any special characters in the prefix. In SQL LIKE patterns, the % and _ characters have special meaning (any sequence of characters and any single character, respectively). We made sure to escape these in the prefix before appending % so that, for instance, a literal prefix "ja%" is treated as "ja\%" in the pattern (meaning the string "ja%") and then becomes "ja\%%" after adding the wildcard​. This way, the semantics of starts_with (which treats the prefix as literal text) are preserved in the LIKE expression.

The core snippet of the implementation looked like this:

if let Expr::Literal(ScalarValue::Utf8(Some(prefix))) = &args[1] {
    // Found constant prefix, convert to LIKE pattern
    let escaped = prefix.replace("%", "\\%").replace("_", "\\_");
    let pattern = format!("{}%", escaped);
    let like_expr = Expr::Literal(ScalarValue::Utf8(Some(pattern)));
    return Ok(ExprSimplifyResult::Simplified(Expr::Like(Like {
        negated: false,
        expr: Box::new(args[0].clone()),    // the string column
        pattern: Box::new(like_expr),       // the 'prefix%' pattern
        escape_char: None,
        case_insensitive: false,
    }))));
}
// If not a literal prefix, no change
return Ok(ExprSimplifyResult::Original(args));

In practice, we wrote similar logic for all relevant string types (Utf8, LargeUtf8, and the newer Utf8View) since DataFusion supports multiple string array types. The idea is the same: detect literal prefix, build the 'prefix%' pattern, wrap it in a LIKE expression node, and mark the expression as simplified.

  1. Updating tests: With the implementation in place, we updated and added tests to ensure everything works as expected. We wrote unit tests for starts_with to confirm that the function still returns correct boolean results for various cases (e.g. "alphabet", "alph" -> true, "alphabet", "bet" -> false, etc.)​. We paid special attention to edge cases like an empty prefix (which should always yield true as long as the string is not NULL, since every string starts with ""). We also added an integration test using DataFusion's SQL engine to verify the optimization kicks in. For example, using an EXPLAIN query on a filter with starts_with, we should see the plan has been rewritten to use LIKE. After our change, an EXPLAIN SELECT * FROM my_table WHERE starts_with(col, 'f') showed a filter of col LIKE 'f%' in the logical plan, and the physical plan included a pruning predicate based on the prefix​. This confirmed that predicate pushdown was working: DataFusion was able to determine, for instance, that only values between "f" and "g" (non-inclusive of "g") could match f%, and use that to skip irrelevant data. All new tests passed, demonstrating that the simplify rule behaves correctly.

Why DataFusion's Design Made This Easy

Implementing this feature highlighted how well-structured DataFusion is. The codebase cleanly separates concerns: the logic for each function (including execution and optimization rules) lives in one place. We didn't have to tinker with the core optimizer loop or planner; we just implemented the simplify method for starts_with and the existing optimization framework took care of invoking it. DataFusion already had patterns for similar rewrites (for instance, simplifying certain regex patterns to equality checks in the past), so we were able to follow an established approach. This modular design meant less risk of breaking unrelated parts of the system and made the code review process smoother. It's a testament to DataFusion's extensibility that a contributor could add such an optimization in a relatively small, focused PR.

Benefits and Impact

This enhancement brings several benefits:

Conclusion

Implementing the simplify rule for starts_with was a satisfying journey that improved DataFusion's optimizer and demonstrated the strength of its design. In just a few lines of code, we unlocked an optimization that can benefit many real-world queries. This change makes DataFusion smarter about string predicates, allowing it to skip unnecessary work and respond faster. For DataFusion users, it means you can write queries with starts_with and get performance similar to writing a LIKE by hand – the engine does it for you. For Rust developers, it's an example of how contributing to an open-source project like DataFusion can be straightforward thanks to clear abstraction boundaries. We hope this encourages more contributions and explorations into DataFusion's optimization capabilities. After all, as we've seen, a simple change (pun intended) can go a long way in making your queries run simply faster!

© Jagdish Parihar.RSS