Everything Engineering


🚗 Building blazingly fast pre-owned car platform with Valkey - Part 1

~13 min

Introduction

The content will cover how to leverage Valkey for serving web pages at blazingly fast speeds. This guide assumes you have some familiarity with backend technology.

What is Valkey? Valkey is an open-source, high-performance key/value datastore licensed under BSD. It supports a variety of workloads such as caching, message queues, and can function as a primary database. Valkey can operate as a standalone daemon or in a cluster with options for replication and high availability.
It originated as a fork of Redis after Redis changed its license terms. More details on the forking issue can be found at arstechnica.com.

What ? #

Our platform for pre-owned cars allows buyers to access detailed reports, schedule test drives, and make purchase decisions. It combines a marketplace and inventory model.

Marketplace #

We provide a platform for buyers and sellers to communicate directly. Revenue is generated by facilitating buyer-seller interactions and charging for contact details.

Inventory #

We own the cars and provide test drives at various locations. Revenue comes from higher margins on refurbished cars.

User flow #

The typical user journey involves

  1. Viewing ads
  2. Generating a lead by entering a mobile number
  3. Booking a visit on the platform
  4. Test driving at a store
  5. Making a purchase decision

Our goal is to maximize lead generation by providing a fast and seamless user experience.

Why ? #

  1. Dedicated Page - Complete details of a single car, featuring over 10 photos and additional information.
  2. Listing Page - A catalog of cars with basic information and about four photos that can be scrolled horizontally.

Amazon found that every 100ms of latency cost them 1% in sales. In 2006, Google found an extra .5 seconds in search page generation time dropped traffic by 20%. - Marissa Mayer

Specifications #

Exisitng design #

Frontend #

PWAs load quickly after the first visit, but the initial load can be slow, especially on low-end devices. However, once loaded, they offer smooth navigation and offline access. We aim to make pages load fast right from the first visit, even on low-end devices.

Backend #

The request-response lifecycle involves multiple steps, with significant time spent on model serialization and database (MySQL) calls.

Dedicated Page #

GET https://api.car.com/listing/14006824/

Fetches car data based on an ID in URI and returns JSON.

Dynamic Entities

Listing Page #

1. GET https://api.car.com/listing/?city_id=1&model_id=123&owner=1
2. GET https://api.car.com/listing/?slug=used-tata-nexon-cars-in-mumbai
3. GET https://api.car.com/listing/?slug=used-tata-nexon-cars-in-mumbai&buyer_id=123
4. GET https://api.car.com/listing/?slug=used-tata-cars-in-mumbai&seen_card_ids=10,11,12

Data science model can use either buyer_id or seen_car_ids to order list of cars based on most relevant to least. In case of None order based on generic recommendation.

def get_recommendation(active_car_ids: [int], buyer_id: int, seen_card_ids: Optional[int]):
    return model.get_order(active_car_ids, buyer_id=buyer_id, seen_card_ids=seen_card_ids)

Filters #

Filter By - Car colour, Make, Model, Accessory, Rating, Ownership etc.

Implemented with Django-filter, complex queries with multiple joins can strain the database.

Pagination #

Returning all cars in Mumbai city in a single request would strain all components involved. For instance, if there are 1,000 cars:

  1. Database - Must fetch details for 1,000 cars.
  2. Backend Server - Converts the database response into JSON, causing high CPU and memory usage.
  3. Client - Generates HTML for 1,000 cars, and the browser must paint the screen.

To avoid these issues, we return only a subset of cars per request. Among cursor, limit-offset, and page-no pagination options, we use page-no pagination.

Page No Pagination #

Request Query Parameters

  1. page_size - The number of cars to display, adjustable based on the client (e.g., more cars for desktop users).
  2. page_no - The current page number, indicating the user’s position.

Response JSON Object

  1. count - The total number of pages available. If the requested page exceeds this count, the client shows “No more results.”
  2. next_page - URL to fetch the next page.
  3. prev_page - URL to fetch the previous page.
  4. results - A list of JSON objects containing basic car information.

Car Ordering #

Clients can specify the order of cars using the order_by query parameter. The backend uses this to sort the results.

Order Options

  1. reco - Most relevant to least (default order, based on a data science model).
  2. newest - Recently added cars first.
  3. price_asc - Cars sorted by ascending price.
  4. price_desc - Cars sorted by descending price.
GET https://api.car.com/listing/?city_id=1&order_by=reco&page_no=2&page_size=10

For recommendations, the order of cars is generated with each request.

Achieving the 60ms response time #

Optimize Frontend #

How Does Client-Side Rendering Work? #

Client-side rendering (CSR) involves the frontend receiving a JSON payload and using JavaScript to create HTML for displaying data. Libraries like React handle this efficiently by checking if the DOM tree needs updating and only modifying the parts that have changed. However, this process can be slow on low-end devices because React has to generate the HTML, perform a diff, and then update DOM.

Why Not Send HTML Directly? #

Sending pre-rendered HTML, which browsers are optimized to handle, can be more efficient. This is where Next.js (year 2019), becomes valuable. By running a Node server on the backend, Next.js can send HTML for the initial page load. After this, user interactions such as navigating between pages or viewing details fall back to client-side rendering. This hybrid approach offers a smooth, app-like experience.

Optimize Backend #

North star for us would be avoiding duplicate work and reducing runtime computation via storing pre-computed results.

Duplicate work #

  1. Creating JSON for same car with each request
  2. Computing order for recommendations for anon users
  3. Computing order of cars for each page for given user or set of seen cars

Pre-compute #

  1. Dynamic part of JSON that only depends on car.
  2. Store Counters and Boolean to avoid DB calls.

Consider these points:

Optimize Dedicated Page #

Anonymous users

  1. Store pre-computed JSON response of car
  2. Fetch JSON by key and return response
key_pattern = "dp:{card_id}"
expiry = 12 hours
value = gziped json payload
"dp:2323" : '{"id":2323, "images": ["url1", "url2", "url3"], "owner": "2nd", "inspection_report": ...}'

Compress JSON payload using GZip before storing the requests using django-redis config.

import gzip
CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": os.getenv("CACHE_LOCATION"),
        "OPTIONS": {
            "PARSER_CLASS": "redis.connection.HiredisParser",
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "COMPRESSOR": "django_redis.compressors.gzip.GzipCompressor",
        }
    }
}

Logged in users

Computing results at runtime as response had many dynamic entities. Since most users would come from marketing campaigns, we wanted to optimize their experience first.

Are we done ?

There are only two hard things in Computer Science: cache invalidation and naming things. – Phil Karlton

Not yet.

Cache Invalidation #

We must invalidate the cache whenever changes occur in database that impacts the JSON response. Otherwise, users may encounter stale results, leading to undefined behavior and customer dissatisfaction.

Let’s say to compute JSON response for given car we use information from below tables

  1. cars (id, varient_id, color_id, owner, status) - main car model
  2. car_images (car_id, image_url)
  3. car_features (card_id, feature_id)

How do we get notified each time a change occurs in above tables ?

Utilise

  1. Db model abstraction provided in MVC frameworks. Here we can add logic in save method of model or use signals
  2. Architectural pattern like Domain-Driven-Design to add logic that invalidates cache
  3. How database replica functions, we can attach program that acts like replica and get changes.

We went with 3rd approach as there were many sources which could update the database, and adding cache invalidation logic would become error prone and pollute other business logic.

We can listen to database changes on these 3 tables to invalidate the result stored in cache against given card_id.

In MySQL, we can listen to the binlog1 to get notifications about all changes to the database, including both DDL (Data Definition Language) and DML (Data Modification Language) operations.

# Data Definition Language

- CREATE | ALTER | DROP TABLE

# Data Modification Language

- INSERT | UPDATE | DELETE | REPLACE FROM TABLE

Using the mysql-data-stream-kafka Python library, we attach a program to listen to the MySQL change log and emit changes to Kafka topics. A Kafka topic functions similarly to a log.

What Is a Log? A log is perhaps the simplest possible storage abstraction. It is an append-only, totally-ordered sequence of records ordered by time. - Jay Kreps, Linkedin

# Topic Name

{database_name}.{table_name}

# Messages

op: insert | delete | update
before: json payload - column values before update
with
key = column name
value = column value
after: json payload - column values after update

Program would listen to kafka topics that can affect the response of dedicated page and rebuilds or invalidates the cache.

Infrastructure #

We introduced below components.

  1. Consumer node - to listen changes and update cache
  2. Producer node - to listen mysql bin log and push changes to kafka
  3. ValKey server - to store cache data in-memory
  4. Kafka-cluster - to store database change log
  5. MySQL database - existing component

Now we have distributed systems, there comes fallacies.

  1. The network is reliable; - Wiki

Failure of Consumer Node #

We listen to a Kafka topic and commit only after processing changes to build or invalidate the cache, ensuring at-least-once processing.

Failure of Producer Node #

In a network partition involving MySQL, Kafka, and the producer node, a worker reads from the MySQL change log and pushes changes into Kafka. If it fails to push changes to Kafka, we avoid data loss by committing the bin-log position with each message to the Kafka topic. Upon failure, we resume reading the bin-log from the last committed position.

Failure of Val-key Server #

Using AWS managed ElasticCache, or for a self-hosted approach, running Valkey in cluster mode with auto-failover ensures resilience. If the consumer node cannot invalidate the payload due to server failure, it will clear the backlog once the server is available.

For 10,000 active cars, each with a 20K payload, only 2GB of memory is needed. To prevent failures from excessive memory use, set the max-memory policy to volatile-lru and with TTL on keys. This helps manage memory by evicting the least recently used keys with a TTL.

Failure of Kafka-cluster #

Kafka’s resilience comes from having at least 3 brokers and a topic replication factor of at least 2. Higher replication factors increase disk space and network bandwidth usage.

Failure of MySQL #

MySQL’s bin-log is disk-stored, so a server restart or connection failure won’t cause data loss if the bin-log isn’t cleared. The producer will resume from the last bin-log position.

To avoid disk space issues during a prolonged producer failure, set up alarms and ensure enough disk space to handle a few hours of bin-log growth.

In case of bin-log loss, a Django management command can rebuild the cache for all active cars.

Optimize Listing Page ? #

Stay tuned for Part 2, where we will explore practical techniques for achieving pagination with Redis, including strategies for using sorted sets, lists, and other data structures.

Just like a good joke, a cache should never be stale. Keep it fresh, keep it fun. Happy coding! 😄

Thank you Vaibhav for reading drafts and suggesions.


  1. Database systems utilize various methods to track and manage changes to data. Two common approaches are the binary log in MySQL and Write-Ahead Logging WAL in PostgreSQL.

    In MySQL, the binary log records a series of “events” that describe database changes, such as table creation or modifications to table data. It also logs statements that could potentially alter data, even if no actual changes occurred (for example, a DELETE statement that matched no rows). Additionally, the binary log includes information about how long each statement took that updated data.

    PostgreSQL employs Write-Ahead Logging (WAL) to ensure data integrity. The central concept of WAL is to log changes to data files (where tables and indexes reside) before applying those changes to the actual data files. This means that data pages do not need to be flushed to disk on every transaction commit. In the event of a crash, the database can be recovered using the WAL records, which allows for roll-forward recovery, also known as REDO.

    Both MySQL and PostgreSQL, along with other database platforms, offer mechanisms to read and capture changes. One such mechanism is Change Data Capture (CDC). CDC refers to the tracking of all changes in a data source—whether it’s a database, data warehouse, or other system—so that these changes can be captured and propagated to destination systems. This enables organizations to maintain data integrity and consistency across various systems and environments. ↩︎

Tags: