Design Proximity Service
Designing a scalable system for finding the near by shops
Proximity service is used to find the K-nearest objects based on target location. This service is useful to find the places or people near you or within a certain radius.
Many companies have a business model around this problem :
1. Yelp — finds the businesses(hotel’s, restaurant’s etc) near by
2. Google Maps Near me — find the places/businesses near by
3. Tinder — locate the nearby people for dating
Now that you get an overview of what is proximity service, let’s look at this from technical lens and understand the core problem it is solving
Problem
Problem solved by proximity service is three fold:
- Proximity service has to manage the businesses/people data who are registered. Management includes providing the ability to add, fetch, update and delete the business details.
- Serving millions of request trying to fetch nearby places/people with low-latency and accurate information.
- Supporting additional features like — filtering on radius, type of business etc.
Assumptions
- We will design the service for searching near by businesses.
- Newly registered businesses will be reflected next day.
- There are 100 million active users daily.
- Total business count is 200 million.
- Each user search 5 times a day on average.
- Daily 1000 business add/update calls are received by service.
- Each read requests return max. 50 businesses due to pagination.
- Each business data is approx. 20kb.
Lets break this problem into functional and non-functional requirements
Functional Requirements
- Return near by businesses with in the defined radius(min-500 meters , max-20KM).
- Add/update/delete a business.
- View details of a business.
Non-Functional Requirements
- Response should have low-latency.
- The service should be scalable to peak traffic.
- Data security for user querying the system and businesses.
Scale Estimation
Since we have only 1000 write requests per day and millions of read requests per day. we can ignore write requests per second and focus on read requests per second.
This also concludes that our system is read heavy.
Let’s calculate read requests per sec. we have 100 million daily active users and each user requests 5 times a day.
DAU = 100 million = 10^8
Total requests per day = 10^8 X 5
Total requests per second (RPS) = 5 x (10^8/10^5) = 5x1000 = 5000
So, our read RPS or throughput is 5000. Considering the future scale for 5 years with growth of 20% per year. we have to build our system to scale upto 10k RPS.
Let’s calculate the persistent storage for the next 5 years.
Total business = 200 million = 2 x 10^8
with each business taking 20kb space
Total space needed for 200 million business = 20 X 10^3 X 2 X 10^8
= 4 X 10^12 = 4 TB
for the next 5 years with 20% growth = approx. 10 TB
So, with single database server for 5 years with 20% yearly growth we will need 10 TB of storage.
In case of replication, we can multiply this number with replication factor.
For example — with 3-replica server and 1 master server, we will need total 4 storage servers and thus total storage space needed will be 40 TB.
Data Modelling
As we have seen in capacity estimation, system is read heavy with few write requests per second. For read heavy systems, relational database like MYSQL/POSTGRESQL are a good fit.
In our database schema, we will be maintaining two tables —
Business Table
Business table contains information about the business as shown below
Geo Index Table
Geo index table will contain information about the geohash of the businesses and will be used to query the businesses nearby. It will be discussed in more detail in algorithm section
API Design
We will use REST API’s for communication and design the simpler version.
Search API
For querying the nearby businesses. It will be a GET call with query parameters to pass radius, page number or business type for more filters.
GET /api/v1/search?radius=&page=&type
// API Response
{
total: 1000,
data: [{business_object}]
}
Business API
For managing the business data. It will consist of 4 API’s to perform required CRUD operations.
GET /api/v1/business/<:id>
POST /api/v1/business
PUT /api/v1/business/<:id>
DELETE /api/v1/business/<:id>
Algorithms
Let’s discuss various algorithms we can use to solve this problem with efficiency.
2-d Search
The most simple way is to directly query the sql database with the latitude and longitude of the user and return the results within a radius.
select business_id, latitude, longitude
FROM businesses
WHERE (latitude between {user_lat} - radius and {user_lat} + radius)
AND (longitude between {user_long} - radius and {user_long} + radius)
/*user_lat and user_long is received in API from user*/
This query is not efficient as we have to scan the table with around 200 million entries. If we use indexing over the column latitude and longitude, then we might be able to fetch the data but only for 1 column at a time.
That means we have to execute two queries — one for latitude and one for longitude and then intersect their results. This appraoch is simple but not very efficient with large dataset.
Evenly Divided Grid
One approach is to divide the world into even grids and this way businesses will be divided in grids. Each grid will be assigned a hash id and it will help in building indexes for fast searching.
It solves the problem to some extent but there are some major issues with this approach. Some of the grids will have large number of businesses (famous cities) while some of the grids will have few to zero businesses (deserts, forests etc).
We have to manage the unnecessary grid data and response time will vary for each grid.
Geo Hash
Geo hashing is used to divide the world into more controlled grids . It tackles the problems faced in evenly divided grid.
It basically divide the world into 4 quadrants and each quadrant is represented by [[0,0][0,1][1,0][1,1]]
Now, we can divide each quadrant into further 4 quadrants for more accuracy.
Now, for each quadrant we have an binary id.
Geohash of the Google Headquarters is
1101 10100 11000 01101 10111 01011 (base32 in binary) → 9q9huv (base 32)
We are only interested in geohash range (4,6) because more than 6 means the grid size is too small and less than 4 means grid size is too big.
This approach is good and used by Bing map, Lyft, Redis and Elasticsearch .
Despite being a good approach, it has some issues — Boundary issues and not enough business in quadrant issue.
Quad Tree
Another good approach is using a tree like data structure. In this approach also we divide the 2d space into quadrants until it meet the criteria.
The leaf node is divided further into more leaf nodes if it has more businesses than required. After all leaf nodes meet the requirement, all of our leaf nodes/quadrants will not have more than a certain number of businesses.
It tackles the issue of quadrants with less or more businesses.
This approach is used by Yext and Elasticsearch
Only problem with this approach is we have to build this tree and store it in memory for accessing node with required businesses. With every new business add/update/delete we have to re-compute the tree in memory.
Google S2
Similar to quad tree it also stores the data in memory. It maps the sphere to 1d index based on a hilbert curve. Concept of hilbert curve is little complex and can’t be covered in this article but you can read about it — here.
Google S2 is great for geofencing as it can cover arbitraty areas with varying levels.
Google Maps and Tinder use Google S2.
High Level Design
Searching nearby businesses
- The client trying to find the restaurants within 500m will sent the request with latitude, longitude and radius included.
- The load balancer will forward the request to NearBy Search(NBS) Service.
- Based on the user location and radius of user NBS service will generate the geohash to match the precomputed geohashes.
- Then it will add the neighbouring geohashes too to the List of geohashes to ensure user gets enough restaurants to choose from.
list_of_geohashes = [user_geohash, neighbour_1_geohash, neighbour_2_geohash,neighbour_3_geohash,…] - For each geohash in list_of_geohashes, NBS will call the `Geohash` redis server to fetch the business ids. These calls can be sent parallely to reduce latency.
- Based on the data received from redis servers, NBS will call the `Business Info` redis server to fetch its meta info. Based on that info, this data will be ranked on the basis of distance from user.
View/Add/Update/Delete Business
- Either Business tries to view/add/update/delete business or user tries to views details of a business, call will land on load balancer and will be forwarded to Business service.
- If API call is for viewing business details, it will send the call to `Business Info` redis server to fetch data from caching layer.
- In case of a cache miss, the data will be fetched from the Relational database and stored in redis server before returning response.
- For all the other api calls, the primary database will handle those requests and update the replica servers for the changes.
- Since, any update to the business info can take 1 day to reflect, we can run cron jobs to sync the data to redis servers at night(due to low traffic)
Additional Points
- If we are using cloud services like AWS, we can use region and availability zones to our advantage. It will also help in adjusting to government policies for different regions. For example — Europe has different data privacy policies than India.
- We can provide various filters to user to only select certain type of businesses from nearby Search results.
Conclusion
There are still many points which we haven’t discussed in this article, and maybe we can discuss them later in my next articles.
For this article, I have taken inspiration from Alex Xu & Sahn Lam’s book `System Design Interview Vol.2`
If you found this Medium story helpful, kindly let me know in the comments. I would love to hear your opinion.