How to get an efficient data ingestion solution using Java, Apache Arrow and Apache Parquet

Question

I'm working on a data lake solution for an IoT framework that does 44Khz data acquisition for a few dozen sensors (~990.000 measures/seconds).

I would like suggestions on how to get an efficient data ingestion solution using Java 11+, Apache Arrow and Apache Parquet .

For data ingestion I am currently using the AvroParquetWriter implementation at https://github.com/apache/parquet-mr and I would like to partition the dataset using two fields: timestamp and sensor name.

I'm not finding examples of creating partitioned datasets in this API.

I can switch from AvroParquetWriter. Furthermore, the solution does not need to support distributed clustered processing. Just separating the partitions into different directories on the local filesystem is enough.

By the way, I currently use DataFusion to query datasets writed by AvroParquetWriter. Data ingestion performance is satisfactory. My interest in partitioning the data serves the purpose of improving query performance.

Regards

related question: https://stackoverflow.com/questions/72233354/how-to-read-parquet-files-into-tables-in-java-using-apache-arrow — cpchung, Jun 19 '22 at 02:13
It sounds like you have a partial? solution at this point. Do you have any data on the performance that you see now? — L. Blanc, Jun 20 '22 at 15:24
Hi @L.Blanc, Data Ingestion performance is satisfactory. My interest in partitioning the data serves the purpose of improving query performance. — João Paraná, Jun 20 '22 at 17:12

How to get an efficient data ingestion solution using Java, Apache Arrow and Apache Parquet

0 Answers0