ShareChat hit a billion features per second, then it had to make it 10x cheaper

“Great system…now please make it 10 times cheaper.”

That’s not exactly what the ShareChat team wanted to hear after completing a major engineering feat: scaling a real-time feature store 1000X without scaling their database (ScyllaDB). To scale from supporting 1 million features per second to 1 billion features per second, the team has already.

  • Redesigned their database schema to store all features together as protocol buffers and optimized tile configurations (reduced required rows from 2B to 73M/sec)
  • Switched their database compaction strategy from incremental to leveled (doubled database capacity)
  • Forked caching and gRPC libraries to eliminate mutex contention and connection bottlenecks
  • Applied object pooling and garbage collector tuning to reduce memory allocation overhead

You can read about those performance optimizations in Scaling an ML Feature Store From 1M to 1B Features per Second.

But ShareChat — an Indian leader in a globally competitive social media market — is always looking to optimize. After reaching this scalability milestone, the team received a follow-up challenge: reducing the feature store’s costs by 10X (without compromising performance, of course).

David Malinge, a Sr. Staff Software Engineer at ShareChat, and Ivan Burmistrov, then Principal Software Engineer at ShareChat, shared how they approached this new challenge in a keynote at Monster Scale Summit 2025. Watch the complete talk, or read the highlights below.

About Monster Scale Summit: It is a free, virtual conference on extreme-scale engineering, with a focus on data-intensive applications. Learn from leaders like antirez, creator of Redis; Camille Fournier, author of The Manager’s Path and Platform Engineering; Martin Kleppmann, author of Designing Data-Intensive Applications and more than 50 others, including engineers from Discord, Disney, Pinterest, Rivian, Datadog, LinkedIn, and UberEats. ShareChat will be delivering two talks this year. Register and join us March 11-12 for some lively chats.

Background

ShareChat is one of the largest Indian media networks, with more than 300 million monthly active users. The app’s popularity stems from its powerful recommendation system, which is powered by its feature store.

As Malinge explained at Monster Scale Summit, “[ShareChat processes] more than two billion events daily to compute our features and serve close to a billion features per second at peak, with P99 latency under 20 milliseconds. We also read more than 30 billion rows per day in ScyllaDB, so we’re very happy that we’re not charged by the transaction.”

Cloud cost optimization: Where do you even start?

When it came time to make this system 10 times cheaper, the team very quickly realized how complicated and confusing cloud cost optimization could be.

For example, Burmistrov noted these challenges during the Monster Scale Summit:

  • Cloud billing is cloudy. AWS and GCP each have around 40,000 different SKUs, which makes it really hard to figure out where your money’s actually going. And cloud providers aren’t exactly motivated to make this easier for you to understand and debug.
  • It’s easy to lose track of little things that add up. Maybe you forgot about an instance that’s still running. There may be an old deployment nobody uses anymore. Or perhaps you have storage buckets filled with data that should have expired by now (e.g., via Time to Live [TTL]). It all accumulates over time.
  • Your cost intuition from on-prem doesn’t necessarily translate to the cloud. Something that was cost-efficient in your own data center might be expensive in the cloud, and what works well on one cloud provider might not be cost-effective on another.

An inverted triangle diagram titled "Cost savings funnel" from the Monster Scale Summit, illustrating a three-tiered approach to reducing cloud expenses.

For the first optimization step, the team became sticklers for hygiene: clearing out anything they didn’t really need. This involved a deep cleaning for forgotten instances and deployments, unused buckets, data without proper TTLs, and overprovisioning. Having proper attribution was critical here.

As Burmistrov explained, “Every workload, every instance, and every resource must be attributed to a specific service and a specific team. Without attribution, cost optimization is a global problem. With attribution, it becomes a local problem.

Once each team can clearly see the costs associated with their own services, optimization becomes much easier. The team that owns the service has both the context and the incentive to improve its cost profile. They can see which components are expensive and decide what to do about them.”

Cloud trap 1: Getting sucked into unsustainable costs

Next, they shifted focus to challenges unique to running applications in the cloud. To get the required cost savings, they had to navigate around a few cloud traps.

The first trap was the allure and ease of using the cloud provider’s own products – particularly,
databases. “They’re great to start with, but at scale they become painful in terms of cost,” explained Burmistrov. Scaling can be particularly costly for solutions that use a metered pricing model (e.g., based on traffic).

“We used several GCP databases, including Bigtable and Spanner, but they became expensive and, more importantly, those costs were largely out of our control; we had little leverage over cost,” continued Burmistrov.

“So we migrated many workloads to ScyllaDB. Today, more than 30 ScyllaDB clusters are deployed across the organization, including for our feature store use case. Besides stability and low latency, ScyllaDB gives us strong cost control. We can choose instance types, merge use cases into a single cluster, and run at high utilization. ScyllaDB works well at 80%, 90%, even close to 100% utilization, which gives us a very strong cost profile.”

Cloud trap 2: Kubernetes wastage

Next, the team tackled Kubernetes wastage. In Kubernetes, you deploy containers into pods, but you pay for nodes, which are actual VMs. You typically choose node types upfront – for example, nodes with 4 CPUs and 16 GB of memory. However, over time, workloads change, teams change, and these nodes are no longer optimal. Pods cannot fully occupy the nodes due to CPU, memory, or scheduling constraints. At that point, you’re paying for resources that aren’t being used.

A technical diagram illustrating ScyllaDB’s thread-per-core architecture and its asynchronous scheduler. The image shows a single CPU core managing multiple task queues, each assigned a specific "compute share" to prevent resource contention:

“You may decide, ‘Okay, let’s isolate workloads into separate node pools,’” explained Burmistrov. “But it’s not a solution because, as shown in the image below, we have one node completely wasted.” Simple isolation just moves waste around.

This image illustrates the final part of the Kubernetes (k8s) wastage "cloud trap," specifically showing how isolating workloads can actually create more waste.

The solution: dynamic node allocation based on workloads. Options include open-source solutions like Karpenter, cloud-specific ones like GKE Autopilot, and commercial multi-cloud solutions like Cast AI.

A technical diagram titled "Solution: Advanced k8s scheduler" illustrating how dynamic node allocation solves the problem of cloud resource wastage.

ShareChat likes Cast AI because it lets them fine-tune the configuration for long-term commitments, spot instances, and other pricing factors.

Cloud trap 3: Network costs (“The cloud tax”)

And then there’s network costs, which the ShareChat team calls “the cloud tax.”

Massive apps like ShareChat deploy across multiple zones for high availability. However, multiple zones need to communicate with one another, and traffic between zones incurs its own cost: Network Inter-Zone Egress (NIZE). That’s the outbound network traffic between availability zones within the same region. It’s billed separately by many cloud providers.

For example, if you deploy ScyllaDB across three GCP zones, with two nodes per zone, and use a non-ScyllaDB driver (not recommended), reads may cross zones both at the client level and within the cluster. For traffic of volume T, you pay for T in NIZE.

A technical diagram titled "Naive reads: no settings in the client driver" explaining the "cloud tax" associated with multi-zone deployments.

ScyllaDB drivers help here. As Burmistrov explained, “ Using ScyllaDB’s token-aware routing removes the extra hop inside the cluster, reducing inter-zone traffic. Using token-aware plus zone-aware routing ensures reads stay within the same zone, reducing inter-zone traffic to zero for reads.” Removing the extra hop cuts NIZE down to about 2/3 of T.

A technical diagram titled "Smartest reads: + zone-aware routing" demonstrating how to eliminate cloud network costs.

On the write path, it’s a little different because replication is required. Burmistrov continued, “With three zones, writes generate 2 times T of inter-zone traffic. You can trade availability for cost by using two zones instead of three, reducing this to 1.5 times T.

A technical diagram titled "Smartest+ writes: 2 zones" illustrating a cost-optimization strategy for database writes in the cloud.

ScyllaDB also lets teams model each zone as a separate data center. In that case, for traffic T, you pay for T of inter-zone traffic. You generally can’t go lower unless you deploy in a single zone.”

A technical diagram titled "Smartest++ writes: 2 datacenters" showcasing a strategic architecture to reduce cloud network egress fees.

Tip: Oracle Cloud and Azure don’t charge for inter-zone traffic. If you use one of these cloud providers, you can deploy across three zones with zero inter-zone cost.

Database layer optimization

Another challenge: ShareChat’s generally stable read latencies would spike above their SLAs during write-heavy workloads (e.g., a backfill job).

A performance monitoring line graph titled "Feature Service Latency vs. Time" showing the impact of resource contention on system stability.

The usual option — just scale up the database — would be simple, but the team wanted to reduce costs. Isolating reads and writes into separate data centers could technically improve performance, but that would be even worse from a cost perspective. That approach would double the infrastructure and probably lead to underutilization.

Ultimately, they discovered and applied ScyllaDB’s workload prioritization. This lets them control how their various workloads compete for system resources.

A technical slide titled "#2 Different workloads" explaining the implementation of Workload Prioritization (WP).

Malinge explained, “Under the hood, this relies on ScyllaDB’s thread-per-core architecture. Each core runs an async scheduler with multiple queues, each with its own priority — workload prioritization maps directly to these priorities. Looking at the image, we have a very high compute share for serving because we have strong SLAs on the read path, which is user-facing.

Most of the rest goes to less latency-sensitive workloads from asynchronous jobs, such as writing features to the database. We also keep a small share for manual queries (for example, debugging). This enables us to have exactly what we want – different latencies for reads and writes – and that allowed us to have the best possible serving for our features.”

Feature serving layer optimization

That serving layer is a distributed gRPC (Google Remote Procedure Call) service that handles and caches consumer requests. The team made some cost optimizations here, too.

One of the most impactful was a clever shortcut: instead of fully deserializing complex Protobuf messages, they treated repeatedmessages as serialized byte blobs. Since Protobuf encodes embedded messages using the same wire format as bytes, they could simply append these serialized records to merge data.

With that, they could skip the heavy lift of fully unpacking and rebuilding the messages from scratch. That optimization alone could be the subject of an entire article; please see the video (starting at 18:15) for a detailed walkthrough of their approach.

Malinge’s top lesson learned from optimizing this layer:

“The first advice is the Captain Obvious one: don’t go blindly. We set up continuous profiling very early on. Nowadays, there are tons of tools available. Integration is super straightforward. You might recognize the vanilla GCP profiler in the image below. It’s just four lines of code to add to Go programs, and it’s really helped us on our quest to reduce costs. We’ve actually unlocked more than a 50% reduction in compute by doing optimizations guided by continuous profiling.”

A technical slide titled "Feature store costs: serving features" from the Monster Scale Summit, outlining a strategy for reducing compute expenses in a Kubernetes environment.

Another lesson learned: if you’re using protobuf at high RPS, serde is probably burning CPU. Even with ~95% cache hit rates, the hot path was stilldeserializing protos, merging them, and serializing them again just to stitch responses together. Most of that work was pointless.

The team ended up leveraging protobuf’s wire format, where repeated fields are just sequences of records and embedded messages can be treated as raw bytes (no deserialization). They switched to this “lazy” deserialization and merged cached protos without ever touching individual fields. For a practical example, see this repository david-sharechat/lazy-proto.

A technical flow diagram titled "Proto serde trick: back to cache" illustrating a highly efficient feature-serving architecture designed to minimize CPU usage during data serialization.

And one parting tip: the wins compound fast. Benchmarking showed lazy merging was about 6 times faster, with about a third of the allocations. In production, that meant lower compute bills and better tail latency.

Cost optimization actually fostered innovation

The team’s smart work here underscores that cost optimizations don’t have to compromise the product. They can actually act as a catalyst for better system design.

Malinge left us with this: “Our cost savings initiatives actually led to a lot of innovations. As I tried to represent in the left quadrant, there is a very unhealthy way to look at cost savings – one that focuses on shortcuts that negatively impact your product. The key is to stay on the right side of this quadrant, to aim for the sweet spot of reducing costs while improving the product.”

A four-quadrant matrix titled "Learnings" from the Monster Scale Summit, used to evaluate the long-term value of engineering initiatives based on Cost (y-axis) and Product Impact (x-axis).


Created with Sketch.


Source: thenewstack.io…

We will be happy to hear your thoughts

Leave a reply

FOR LIFE DEALS
Logo
Register New Account
Compare items
  • Total (0)
Compare
0