Unacademy’s core features and functionalities have been part of monolith from the start. Over time, we realised the need for microservices architecture wherein we started bifurcating our core domains. One of them is order management service (OMS), which includes modules like orders, subscriptions, payments, referrals, invoices, offers etc. OMS can further be broken down into different microservices in future based on need. In this blog, we will go through the process of separating OMS from monolith. You would see various references to Django in this blog as our monolith is written in Django.
Why was a new microservice for OMS necessary?
- Improving availability: OMS functionalities were affected by downtime in the monolith service. The goal was to improve OMS functionalities availability to 99.995% from 99.95% availability of monolith.
- Removing deployment coupling: It wasn’t possible to independently develop and deploy OMS changes, as the release branch might have changes from other teams working on monolith.
- Removing SPOF and reducing blast radius: Blast radius is the area of the potential impact that an issue/outage can cause. Whenever monolith went down due to database issues etc, various core functionalities including OMS were impacted due to high blast radius which made monolith SPOF (Single point of failure).
- Scalability: OMS had different read/write patterns and scaling needs as compared to monolith.
- Faster development: Separate OMS service simplifies updates while minimising the impact on other features present in monolith code.
- Shared infrastructure: DB, cache, queues and other infra was shared. This resulted in noisy neighbour problem with respect to performance. The idea behind noisy neighbour here is that a particular functionality of a system could place load on the system’s resources that could have an adverse effect on other features supported by the system. The end result could be that one functionality could degrade the experience of another functionality.
- Heavy intercoupling due to thousands of interleaved usages: OMS modules were tightly coupled with other monolith feature modules like test series, batches etc. One of the reasons for this heavy intercoupling was that both pre-purchase and post-purchase experiences needed to know about a user’s subscription status to handle the discovery of features, user experiences and authorisation. To avoid being bogged down by lakhs of code lines, we made changes that still allowed us to meet our goals while minimising code changes.
- Huge testing scope: Most stages required changes in all core flows which was challenging due to our evolving test suite then.
- Achieve within minimal or no downtime: This was required because most OMS flows affect revenue along with several critical P0 features like the accessibility of live classes etc.
- Huge codebase: Monolith codebase has hundreds of files and over lakh lines of code. In order to remove the dependency of OMS from other monolith features, we also went through large codebases of those features, understood them and coordinated with various teams.
- Lack of separation of concern in existing code: Separation of concern was lacking between database layer and business logic in monolith resulting in django ORM querysets of non-OMS models being used in business logic. Thus changes in database layer code impacted business logic code pieces.
- Daily deployments: In a fast-paced startup like Unacademy, projects develop quickly. Therefore, we needed to separate the microservice along with the significant amount of new code that is being pushed daily. We needed to ensure that developers are following our guidelines for the separation of concerns between OMS and non-OMS in new code pushes.
Monolithic architecture (Simplified version)
New architecture (Simplified version)
The API gateway pattern is suitable for designing and developing large microservices-based applications with multiple client applications. This matches our use case as OMS is client facing and there is a plan to introduce other client facing microservices as well in future.
We put OMS behind our self-hosted API Gateway which handles user identification, authorization and routing client requests to backend services. It will also provide generic technical concerns like rate limiting, cache etc in future.
We also explored the BFF pattern but for our scenario, we didn’t need it as checkout flows were mostly driven by OMS alone and didn’t have scenarios of data aggregation over multiple services.
OMS to other services interaction
Sync: It is done via HTTP/HTTPS rest API calls. We will explore GRPC in future.
Async: Async communication is done via SQS and publish-subscribe pattern (using SNS + SQS/Lambda/HTTPS).
Publish subscribe pattern is used in cases of communication where there are multiple receivers and broadcasting of messages is required. For example, multiple services are required to be made aware when a user purchases a subscription.
SQS is being used for cases where OMS needs to send an event to other services. Here broadcasting isn’t required. For example: if a user wins any scholarship, monolith needs to know and perform a few actions like sending an email to the user with all details. No other service needs to know about this.
Handling scheduled and inter-coupled jobs
Communication between OMS and monolith for the above scenario is being done in async manner through events. We applied orchestration or choreography patterns based on our use cases. More details about use cases will be shared in the next part of this blog series.
You can read more about the above patterns here.
There are some core entities used by all services e.g. user details. We wanted to streamline fetching details of such entities. So, a custom package called the “Core Pkg” package was created in different programming languages. This decoupled the transport layer from entities.
It would be used by different lined-up microservices to get details about core monolith models and thus the same code wouldn’t have to be written in different microservices.
Approach for separating out OMS from monolith
There are multiple patterns to separate out a microservice from a monolith. As OMS sits up higher in the call stack, we decided to go with the Strangler pattern. You can read more about Strangler pattern here. Strangler Pattern offered us an incremental, reliable process for refactoring code.
Proxy for us with respect to the Strangler pattern was our public load balancer. We created a new service in production with the same code base and repo and started diverting traffic towards it as and when we were confident about a particular feature.
Monolith DB was becoming a major bottleneck at that point in time for us. So we decided to separate DB first and then the rest of the things accordingly.
As monolith is written in django, we decided to continue OMS in django to reduce the scope of our project.
A brief idea of steps involved:
1) Removing foreign key constraints among OMS and non-OMS models
We had to drop foreign key constraints to ensure models could be mapped to different underlying databases. To achieve the same performance, we ensured that indexes were present at the join columns.
Changes were also made at the code level to handle above mentioned DB changes. We went through codebase to figure out existing queries in code which had both non-OMS and OMS models. We separated these queries so that both models can reside in different databases.
2) Migration to a new database
During DB migration, we made a conscious call to take up ~ 4 minutes of downtime for write queries since during the least traffic hour, we used to receive very few payments. The below diagram depicts our process:
3) Separating out service and deployment pipelines
We created a new service for OMS. We built a new deployment pipeline which deploys the same monolith repository on a new service. We shifted traffic related to OMS on this new service by adding path-based routing rules in load balancer
4) Replacing usages of other models in the OMS codebase
There were thousands of usages of non-OMS models in the OMS codebase.
Why was it challenging?
Django ORM specific code was scattered throughout the codebase. Any change in ORM specific code resulted in changes in subsequent business logic code as well due to lack of separation of concern. Resolving every occurrence manually would have increased the scope by approx three times.
Following are major use cases we had to solve:
Use Case 1: Direct DB queries on non-OMS models and then different fields of non-OMS models along with foreign keys being accessed in subsequent code.
At this point, repository was the same for both monolith and OMS. To decouple from non-OMS models, we divided the problem into 2 parts:
- non-OMS models which were heavily used in code
- non-OMS models with lesser usage.
For the 2nd part where there were limited usages, we exposed APIs on the monolith side and integrated them directly into the OMS codebase.
For 1st part, we had to come up with a way where replacing a Django ORM query doesn’t require any changes in subsequent business logic code pieces to minimise efforts. Thus we created packages and interfaces whose main components include:
1) Core Pkg: Replacing all queries through apis was tricky as there are lots of variations possible in django queries like custom managers can be used, eager loading for foreign keys till any nested level, filters of each query can be different etc. To handle all such django SQL queries, Core Pkg was created which introduced support for QueryBuilder. It internally calls monolith APIs to get data about core non-OMS models from different services.
2) QueryBuilder: QueryBuilders help us form payload of API request by providing an interface similar to Django ORM Queryset. QueryBuilder provides support to mention what all fields are required, required foreign key details, filters to be used etc. We used method chaining within Querybuilder similar to Django ORM to provide an easy and familiar interface for developers. Thus, DB query was initially converted to QueryBuilder. Example:
3) APIs to support non-OMS model details: APIs were created on monolith with dynamic filters support, foreign key support etc to handle various use cases of non-OMS models. These were called by Core pkg to get the required data.
4) Data Fetcher Layer: We created a separate layer through which we are doing all communications for models supported by Core Pkg to ensure separation of concern. This will help us in avoiding issues which we faced now due to the data layer not being separated. All usages of Querybuilder lie within this layer. So in future, if we ever need to change the data source of non-OMS models ever, we will need changes only at this layer instead of scanning huge codebase again.
5) Django ORM queryset simulator: QuerySetSimulator provides an interface through which we can continue using Django Queryset major functionalities like .first(), .last(), .values() etc. This helped us to avoid modifications in subsequent code dependent on the above queryset Q since the response is internally now fetched from API instead of DB which in turn is converted into QuerySetSimulator which supports similar functions as that of actual Django Queryset.
6) Serializer simulator: It provides an interface to define serializer similar to django but instead of getting data from DB, it calls monolith APIs internally using core pkg. We kept the response structure same as the django serializer so that subsequent code dependent on the django serializer doesn’t have to be changed.
Block diagram for solution using the above components:
Simplified Class diagram (Have excluded some methods and attributes from simplicity point of view):
Use Case 2: Standard Django serializers which were getting data either directly from the database or through custom implementations of some fields.
To handle this scenario, we replaced standard django serializers with serializer simulators described above.
As we have already covered enough about the first few steps, stay tuned for knowing about the rest of the steps in further parts of this series!!