In Uber, we use HDFS as the long term storage for all the Data. Most of this data comes from Kafka which is in Avro format and is persisted in HDFS as raw logs.

These logs are then merged into the long term Parquet data format using a compaction process and made available via standard processing engines such as Hive, Presto or Spark. Such datasets constitute the source of truth for all analytical data.

This is used to backfill data in Kafka, Pinot and even some OLTP or key-value store data sinks. In addition, HDFS is used by other platforms for managing their own persistent storage.

For instance, Apache Flink uses HDFS for maintaining the job checkpoints. This consists of all the input stream offsets as well as snapshots of the Flink job’s internal state per container. Furthermore, Apache Pinot uses HDFS for long term segment archival which is crucial for correcting failed replicas or during server bootstrap.

https://arxiv.org/pdf/2104.00087.pdf


🌱 Back to Garden