There are several business that are built on top of this concept of proximity.
- Finding places like Google places, Facebook places, Yelp places, Swiggy, Grocery E-Commerce Shopping
- Finding people like Tinder, Uber
The core pattern that follows in all these business areas is getNearbyX() where X can be replaced by corresponding core business entity. (restaurants in Swiggy, people in Tinder, drivers in Uber, places in other scenarios)
In this article we try to start at the fundamentals we should be knowing to approach such pattern of systems. Note that based on requirements and business needs the solution and our design choices can vary a lot.
In the article we can start with one requirement and can build upon in different dimensions on how we go about further.
- getNearbyPlaces(lat, long):
- getNearbyPlaces(lat, long, filters): the filters can be in the form like search filter, particular category of places etc. This we can cover in next article adding here to have a thought on how these filters can play certain significant role in query patterns that are important in particular business.
Let’s first establish the definition of nearby: There are three possibilities i can think of. Feel free to leave a comment on any other definitions.
- A nearby place is a place that is within a fixed distance of ‘X’ from either latitude or longitude.
- A nearby place is a place that is within variable distance within range ‘[X,Y]’ based on the provided input(lat, long)
- A nearby place is something that the client determines when querying for places. That translates to the requirement getNearby(lat, long, nearbyDistance)
Let’s proceed with the first definition.
At this stage, we can think of high level data flow as below. Just to get a sense of high level data flow and components involved.
Let’s put some numbers to have some direction of solving the problem right: 500M places, 100K read requests/sec, latency: < 100ms
How can we store the information in database?
Option 1: Mysql table with three columns
Option 1.1: To serve one read request we can loop through all the places in the table that won’t meet the requirement of serving 100k requests/sec. Even to serve few requests the latency will shoot up more than 100ms. This is equivalent to O(P) search. where P refers to number of places. Let’s address the common pattern of this classification of problems. ‘The problem of reducing search space’
Option1.2: Index on latitude and longitude so that the work translates to getNearby(lat, long) => intersectionOf(getPlacesInRange(lat-X, lat+X), getPlacesInRange(long-X, long+X)).Keeping a place id will be performant for our intersection query. Better than previous choice but still doesn’t meet our requirement. This is equivalent to O(LogP) = M(places matching latitude), O(LogP) = N(places matching longitude): max(M,N). Lets further continue on ‘The problem of reducing search space’
Option 2: Since we are owning the system we can take certain freedom in defining how data/places should be stored while entering into the system. This is a way we group places into so called grids. For the problem of reducing the search space, this is a general pattern to introduce another layer: similar to BST, OS page level hierarchy are of similar patterns.
For better visualisation refer to below diagram on how each grid is represented. A grid can be identified by 4 coordinates topLeftX, topLeftY, bottomRightX, bottomRightY, and you can see places P1-P4 are present inside this grid. The size of the grid is fixed which we define internally and is static.
Assume that each grid is of area 10 sq.km. Total Earth surface area ~500 million sq.km we are with 50 million squares.
So now the algorithm translates to
- getGrid(lat, long): mainGrid in time of O(numOfGrids) or we can add indexes on coordinates as well to improve a bit of performance.
- Get neighbouring grids from mainGrid in O(1): One neighbouring grid each in one of the eight directions. Leaving to think about the this choice. Is the premise of “One” neighbouring grid always right? Feel free to leave a comment when this assumption can go wrong.
- Get places in each of these grids with a filter of latitude and longitude withIn distance of X. in O(NumOfPlacesInGrid)
In the next article, We will further continue our designing in other dimensions.