Reducing Under Delivery and Over Delivery of Ads. in Zomato

Reducing Under Delivery and Over Delivery of Ads. in Zomato

Play this article

My first team at Zomato was AdTech Team. Zomato provide options to merchants to run their Ad. campaigns over Zomato’s app

So, first things first, Merchants here are restaurant Owners, chains are groups of merchants under same name( ex : Dominos, McDonald’s).
Now these Merchants need to Promote their restaurants in various verticals :

  • Online Ordering
  • Dine Out
  • Search drop down and results page

So they buy Clicks and Impression counts that they want in a given Period of time, and this is called a Campaign

An impression is when a user only sees an advertisement, Click is when user actually clicks on an advertisement.

Ex : Dominos Connaught Place runs a Campaign of 30k Clicks and 100k impressions and similarly x number of restaurants also running similar campaigns. y amount of money is charged from merchants on basis on number of impressions, clicks, time period, location and time of the year. If the campaign isn’t fulfilled, proportional amount of money is returned back to the Merchant let’s call it z

Now these Impressions and Clicks need to be positioned and shown to users in such a manner and that all these cumulative targets could be achieved with minimising z ( the amount that needs to be returned to merchant )

What’s the Problem that we were facing ?

The existing Pipeline for reporting Aggregated clicks and impressions per campaign per time window was a Batch Processing one, using Apache Spark as distributed processing system. This job used to run in every 15 minutes, Now During Peak loads this Batch Processing used to be so slow that it used to take anywhere from 1.5 hours to 2 hours.

Now this translates to the fact that the data that visibility and positioning algorithms are being fed is of -2 hours, and for those few hours that campaign is still live even though the Clicks/Impressions quota for that campaign could already be fulfilled. For Example: The campaign run by McDonalds was for getting 30k clicks, now let’s say till 7:00 p.m this particular campaign had 27k clicks and as the quota is yet to be fulfilled, it’s still on the at a boosted position, now the Algorithm which decides the position on the campaign didn’t get any data till 9:00 p.m and thus can’t take any action. But in meanwhile of this lag that campaign got 35k clicks, 5k over the asked limit.

So for multiple campaigns Over Delivery of Ads. will be there and for others Under Delivery of Ads. and for the Under Delivered Ads. Zomato had to return the money in proportion.

So for every user action, an event with multiple user and action details are sent to dedicated service

{
"events" :
[
{
"data" : { "campaign_id" : 9857323},
"event_type" : "impression"
}
],
"device_info" : {"sec-ch-ua-platform" : "ios"},
"source_request_id":"7fa67be4-f83a-429f-9d73-38b660c50825",
"user_attributes" : {"id" : "mn3f6f67ff7tfdx6f" },
"deleted_user_attributes" : [],
"user_identities" : {},
"application_info" : {},
"schema_version": 2,
"environment" : "production",
"context": {},
"mpid":7346244611012968789,
"ip" : "172.217.12.142",
.
.
.}

On daily basis ~100 to ~150 million such events stream through our pipeline. These events are then Published to a particular Kafka topic to keep the pipeline asynchronous.

Further a Worker dumps this data to an Apache Hive Table, on top of this Presto was being used as a Distributed Querying engine for OLAP load.

The aggregation logic was written in Apache Spark, The Algorithm for the computation goes like this :

  • Entire day is divided in 8 windows of 3 hours, in sync with Meal time
  • Every Click/Impression in a window is unique, i.e irrespective how many clicks made by a person in particular window it will only be counted as a single click
  • Clicks per campaign is set as key value pair in two format clicks for particular 3 hour window ( 12 am to 3 am, 3 am to 6 am, …)and clicks for particular day till time, ex :
    campaign_1553454_3_030422 = 798 ( campaign id 1553454 has got 798 clicks in 3rd(6am to 9am) window of day till now for 3rd April 2022 ), campaign_1553454_030422 = 9670 ( campaign id 1553454 has got 9670 clicks 3rd April 2022 till now)
  • This Data is published as key-value pair in Redis, The Algorithm deciding the status, position and Boost intervals consumes the data from here.
  • As soon as the promised quota is fulfilled, the campaign is at back of the rail to give other campaigns visibility

How was event Drop Rate dropped to 0?

Consider this once scenario, user saw a rail having multiple ads, clicked few and then closed the app switched off the network. let’s say this happened at 4:37 p.m, now user opens the app again at 7:02 am next day. the app will send this event to the server at this time only, but as the actual click happened a day before it should have been counted in the previous window only. Now in this case events are dropped

There are 3 different notions of time in streaming Programs :

  • Processing Time: Processing time refers to the system time of the machine that is executing the respective operation
  • Event Time : Event time is the time that each individual event occurred on its producing device.
  • Ingestion Time : Ingestion time is the time that events enter Flink

So, what we did was that all the aggregations were done on Event Time contrary to Processing time as used to happened in Apache Spark

How Was the pipeline made entirely Fault Tolerant?

One word : Checkpointing

The fault tolerance mechanism was to consistently recover the state of data streaming applications. The mechanism ensures that even in the presence of failures, the program’s state will eventually reflect every record from the data stream exactly once. The fault tolerance mechanism continuously draws snapshots of the distributed streaming data flow

In case of a program failure (due to machine-, network-, or software failure), Flink stops the distributed streaming dataflow. The system then restarts the operators and resets them to the latest successful checkpoint. The input streams are reset to the point of the state snapshot.

We used RocksDB for checkpointing, RocksDB is an embedded key-value store with a local instance in each task manager.