Hybrid Cloud Data Replication at Uber

Uber’s engineering team has successfully transformed its data replication platform to handle petabytes of data daily across hybrid cloud and on-premise data lakes, addressing scaling challenges caused by rapidly growing workloads.

Scaling Challenges and Solutions

The team built the platform on Hadoop’s open-source Distcp framework, which now handles over one petabyte of daily replication and hundreds of thousands of jobs with improved speed, reliability, and observability.

Key optimizations included moving preparation tasks to the Application Master, parallelizing listing and commit processes, and improving efficiency for small transfers. These changes enabled the platform to support cloud migration and Uber’s active-passive data lake model.

Architecture Enhancements

The HiveSync team enhanced Distcp by reducing HDFS client contention and cutting job submission latency by up to 90 percent. Copy Listing and Copy Committer tasks were parallelized, allowing multiple files to be processed simultaneously while maintaining block order.

For smaller jobs, Hadoop’s Uber job feature ran Copy Mapper tasks directly in the Application Master’s JVM, eliminating roughly 268,000 container launches daily and improving YARN efficiency.

Results and Future Plans

These optimizations increased incremental replication capacity fivefold, enabling HiveSync to replicate over 300 PB during Uber’s on-premise-to-cloud migration without incidents. The team is now focusing on further parallelization, optimized resource management, and network efficiency.

Planned enhancements include parallelizing file permission setting and input splitting, moving compute-intensive commit tasks to the Reduce phase, and implementing a dynamic bandwidth throttler. Uber plans to contribute these improvements as an open-source patch, extending the broader community’s ability to manage extreme-scale hybrid cloud replication.