How Uber used GraphQL and Kafka to build a scalable real-time chat system and improve revenue?

Uber is one of the leading ride-sharing app, with around 149 million+ active monthly customers and 7.1 million drivers across 10,000+ cities. With great scale comes great responsibility – and a ton of customer queries!

Both Customers and Drivers contact Uber support through Emails, Phone Calls and Chats from the app or web site to get their problems resolved.

Email and Voice call channels are neither cost-optimized nor provide a better user experience. Moreover, Users has to wait too long without knowing what is going on in the background. And nothing is worse than making your customers wait for a reslution.

On the otherhand, using real-time chat system, A single agent can prallely respond to multiple users at once, reducing wiat time as well as the cost to serve each customer. According to mirrorfly.com, 79% of customers prefer real-time chat options over emails or phone calls.

Now the question:

  • How Uber build such a high impact feature to handle tonnes of user queries?
  • What are the problems faced by uber with its original chat systems?
  • How Uber optimized it to make it more efficient and saved expenses?
  • what are the key takeaways that we can learn from Uber to build our own real-time chat system?

Intro

Before we know about Uber, you need to know these four things that makes a Chat System the best:

  • Scalability: Your system needs to handle growing user volumes without compromising speed or efficiency. It should also scale down seamlessly when needed.
  • Reliablity: Your solution needs to be reliable. A robust fault-tolerant mechanism and exceptional error handling are essential.
  • Low Latency: The mesaage delivery time must be as low as possible. The optimal latency for a real-time chat is 150ms.
  • Cost Effective: Your system must have optimal costs to serve each customer.

Coming back to the Uber, Uber’s original chat system was developed using WAMP protocol for message delivery and web sockets for real-time communication with agents and customers.

This system has two significant issues:

  • When traffic spiked by just 1.5 times, a staggering 46% of messages to agents failed!
  • The outdated WAMP library couldn’t handle horizontal scaling. Now, upgrading it would be a time-consuming nightmare.
  • Eventually these two issues leads to increased latencies and expenses.

To solve all these issues, Uber decided to upgrade its chat system to make it more reliable and scalable.


After careful consideration, Uber’s Dev Team made two key technical decisions:

  • Using GraphQL subscriptions over web sockets.
  • Using Apache Kafka as a message service on the Backend.

Why Kafka?

  • Kafka has reliable and fast broadcasting (PubSub) capabilities, which made Uber choose it.
  • Many people think of kafka as a message broker. But it’s also sort of a database that stores and retrieves messages in order.
  • It can route messages really well and efficiently.
  • Also, it handles nodes crashing/coming online well.
  • Additionally for scaling, Kafka doesn’t wait for I/O operations. Where as, conventional databases are usually limited by Input Output operations it can perform every second (IOPS) because of transaction boundaries. This can be really slow. It is like 100s of messages from conventional databases (per disk) vs. millions in Kafka.

Why GRAPHQL?

Comming to GraphQL, GraphQL allowed for long-lasting, data-emitting connections with agents. GraphQL subscriptions are long-lasting read operation that emits data when some server-side occurs.

  • The client sends messages to the server via subscription requests, and the serversends back messages to the agent machines. It works in a pub-sub model.
  • Also, they used the graphql-ws library as it had 2.3m weekly downloads, was recommended by Apollo, and had 0 open issues.

Impact of New System

This new system with Kafka and GraphQL brought significant improvements:

  • Increased User Volume: Now, a massive 36% of Uber’s user queries are routed to agents!
  • Improved Reliability: This system has drastically Improved Reliability. Because of Kafka, the error rates dropped from 46% to mere 0.45%.
  • Simplified Architecture: Interestingly, this new architecture is simplified with fewer services and protocols streamline the system.
  • Metrics: It provides valuable metrics on message delivery, system delays, and overall latency.

Key Takeaways

  • Stay Up-to-Date: Always keep your project dependencies up to date. If we observe Uber, the key issue here is the outdated WAMP protocol.
  • Prioritize Scalability and Reliability: Build systems with these from the ground up.
  • Try to keep Latencies low: Aim for less than 150ms latency for a smooth user experience.
  • Tech Matters: The right technological choices can drastically impact user experience and business costs.
  • Add to your portfolio: Building real-time chat system can show case your skills and knowledge. A good to have project in your portfolio.

I hope this article has given you some insights and inspiration for your projects or business solutions.

Share this article
Shareable URL
Leave a Reply

Your email address will not be published. Required fields are marked *