A short summary of how I prepared up to contribute to Apache DataFusion.
Why Open Source Contribution ?
I always wanted to contribute to an open source project which is used widely. I tried and failed many times while I was in college. I think it was because I lacked confidence. Overtime I have gained confidence through hard work and experience. There is a longing to prove to myself that I can understand a huge piece of software, reason about various part of it, understand why it was made and maybe envision how it should evolve.
Why Apache DataFusion ?
My everyday work uses Apache Spark and Trino. I noticed the trend of movement of Big Data from Java to Rust mainly from Twitter feed [1]. I needed to be a part of an Apache project too because its cool. Also, I got really inspired from Andrew Lamb’s Youtube video on DataFusion [2]. So I decided this is it.
Prerequisites
These were the things I needed to know before I started 😶:
- Rust
- Basic syntax, borrowing, multi-threading.
- Writing good rust code.
- Understanding of Apache DataFusion codebase
Meeting Prerequisites
I started studying Rust on June 22, 2024 (Yes I know the exact date !). I followed the book The Rust Programming Language [3] and read it end to end. Wrote code for the examples and the projects. I got a very basic level of understanding of Rust.
I tried to go through the codebase of DataFusion to check if I am ready. Surprisingly, I was not. Totally unfair.
Then I came across a book called How Query Engines Work by Andy Grove [5]. I really liked the book. Its a more practical book than a theoretical one. In short, it says how he implemented a Query Engine in Kotlin from scratch, what are all the modules he made, and he progresses over it step by step. The full code of the book is available too in Github as well for reference.
The game changed for me when I started implementing the chapters in the book on my own in Rust to build a basic query engine. I think it pushed me to understand more Rust patterns and nuances. I built it till the execution of an aggregate expression. While implementing the Query Engine I peaked at the code of Apache DataFusion to see how they are handling the things I was having a hard time with. This helped me get familiarised with DataFusion code.
I spent way too much time implementing the Query Engine in Rust. I spent the most part of Aug, 2024 doing just that. Then, I thought maybe I am ready for making a contribution to DataFusion and I shall try again. I was able to understand the pieces of code in DataFusion now. Everything felt less intimidating. I searched the Github issues to get hold of an issue that I could help solve. I found this issue to add a new array function that returns the first non-null value from an array:
select array_any([NULL, 1, 2, 3]);
----
1
Actually Solving the Github Issue
There was enough templates of array functions lying around in the DataFusion codebase I thought this will be a piece of cake (Poor Me). I read the other array functions implementations. I was able to setup the new function but in the end I was stuck at implementing the core logic. These are things that blocked me:
- What on earth is a
GenericListArray
? - What is a
MutableArrayData
? - What is Arrow and why is it so famous ? (I only had a one-line understanding of Arrow at this point)
I felt that my lack of understanding of Arrow is what is blocking me. So I spent some time on understanding Arrow. The easiest way to understand Arrow is to read this Arrow Columnar Format Specification [4]. Reading this specification I understood that nulls are a first class citizen in the Arrow format and the issue I need to solve is also about nulls. (Coincidence ? I think not.)
Next, I spent some ample amount of time understanding how different array functions like array_element
and array_length
were implemented in DataFusion end-to-end. This time I was able to understand. I understood what GenericListArray
is and how it is important in the context of DataFusion’s array functions. Turns out GenericListArray
represents a column of a DataFusion table or dataframe. One has need to be careful while traversing the GenericListArray
so that no data copying is performed. I also came to understand that MutableArrayData
is the most efficient way to build the Array
you want to send as response from the function that I am implementing.
I felt like I could use the null buffers of Arrow to speed up this function. I did exactly that. I was able to implement the logic this time. Everything aligned. I made use of the nulls
buffer to speed up execution. I also used offsets
buffer to traverse the GenericListArray
faster.
Once I felt like the code is ready, I wrote test cases using sqllogictest [7] to cover all the cases. This was easier because I could see a pattern from the existing test cases in the DataFusion codebase.
I also had to update the doc and user guides for the new array functions. With that the development for resolving this issue was completed. I filed a PR with my changes and got it approved.
Auxiliaries
During my journey, I set up nvim for Rust using NvChad and rustaceanvim. I used to be VS Code guy. This really made me feel like a cool programmer. (I wish I could say this improved my coding efficiency but the speed with which I was coding was never blocked by the text editor I am using 🥲).
Conclusion
Anyone can contribute to a project like Apache DataFusion by working consistently to tackle the missing prerequisites.
Future
Going to keep up my contribution game. Lets see where it goes.
I have a ton of things to learn as well. Here is a small subset:
- There is some new kid in the block – StringView.
- Have not checked out Tokio in all its glory.
- Complete my own personal Query Engine in Rust that I have paused.
References
- Andy Grove’s take on Rust and Big Data – https://andygrove.io/2018/01/rust-is-for-big-data/
- SIGMOD 2024 Practice: Apache Arrow DataFusion A Fast, Embeddable, Modular Analytic Query Engine – https://www.youtube.com/watch?v=-DpKcPfnNms
- The Rust Programming Language – https://doc.rust-lang.org/book/
- Apache DataFusion – https://github.com/apache/datafusion
- How Query Engines Work – https://howqueryengineswork.com/
- Arrow Columnar Format – https://arrow.apache.org/docs/format/Columnar.html
- sqllogictest for testing sql queries against a database – https://www.sqlite.org/sqllogictest/doc/trunk/about.wiki