Exploring the Apache ecosystem for data analysis

By Anais Dotis-Georgiou

The Apache Software Foundation develops and maintains open source software projects that significantly impact various domains of computing, from web servers and databases to big data and machine learning. As the volume and velocity of time series data continue to grow, thanks to IoT devices, AI, financial systems, and monitoring tools, more and more companies will rely on the Apache ecosystem to manage and analyze this kind of data.

This article provides a brief tour of the Apache ecosystem for time series data processing and analysis. It will focus on the FDAP stack—Flight, DataFusion, Arrow, and Parquet—as these projects particularly affect the transport, storage, and processing of large volumes of data.

How the FDAP stack enhances data processing

The FDAP stack brings enhanced data processing capabilities to large volumes of data. Apache Arrow acts as a cross-language development platform for in-memory data, facilitating efficient data interchange and processing. Its columnar memory format is optimized for modern CPUs and GPUs, enabling high-speed data access and manipulation, which is beneficial for processing time series data.

Apache Parquet, on the other hand, is a columnar storage file format that offers efficient data compression and encoding schemes. Its design is optimized for complex nested data structures and is ideal for batch processing of time series data, where storage efficiency and cost-effectiveness are critical.

DataFusion leverages both Apache Arrow and Apache Parquet for data processing, providing a powerful query engine that can execute complex SQL queries over data stored in memory (Arrow) or in Parquet files. This integration allows for seamless and efficient analysis of time series data, combining the real-time capabilities of InfluxDB with the batch processing strengths of Parquet and the high-speed data processing capabilities of Arrow.

Specific advantages of using columnar storage for time series data include:

A seamlessly integrated data ecosystem

Leveraging the FDAP stack alongside InfluxDB facilitates seamless integration with other tools and systems in the data ecosystem. For instance, using Apache Arrow as a bridge enables easy data interchange with other analytics and machine learning frameworks, enhancing the analytical capabilities available for time series data. This interoperability helps build flexible and powerful data pipelines that can adapt to evolving data processing needs.

For example, many database systems and data tools have started supporting Apache Arrow to leverage its performance benefits and become part of the community. Some notable databases and tools in this camp include:

All of the databases that leverage Arrow Flight allow programmers to use the same boilerplate to query multiple sources. Pair this with the power of Pandas and Polars, and developers can easily unify data from multiple data stores and perform cross-platform data analytics and transformations. Take a look at the following blog posts to learn more: Query a Database with Arrow Flight and Reading Table MetaData with Flight SQL.

Apache Parquet’s efficient columnar storage format makes it an excellent choice for AI and machine learning workflows, particularly those that involve large and complex data sets. Its popularity has led to support across various tools and platforms within the AI and machine learning ecosystem. Here are some examples:

Strong community support and innovation

The Apache ecosystem extends far beyond the FDAP stack. Being a part of the Apache ecosystem and contributing to upstream projects offers many advantages to companies, both technical benefits and business benefits. These advantages include:

The Apache ecosystem has made notable contributions to the time series space. By offering a standardized, efficient, and hardware-optimized format for in-memory data, Apache Arrow enhances the performance and interoperability of existing database systems and sets the stage for the next wave of analytical data processing technologies. Apache Parquet provides an efficient, durable file format, easing the transport of data sets between analytics tools. And DataFusion provides a unified way to query disparate systems.

As the Apache ecosystem evolves and improves further, its influence on database technologies will continue to expand, enriching the tool set available to data professionals working not only with time series data but data of all kinds.

Anais Dotis-Georgiou is lead developer advocate at InfluxData.

—

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.