PallyCon DR(Disaster Recovery) System
The stability of online services based on cloud platforms has emerged as a major issue due to the massive failure of the AWS Seoul region in November last year. Traditionally, high-availability (HA) systems have been applied to redundant systems within a region to prevent failures in some systems. However it is impossible to cope with the problems that arise in a whole region with HA systems. Therefore, there is a need for multi-region DR system.This article introduces the application of the multi-regional DR system to PallyCon cloud service to automatically address large-scale failures of cloud platform and minimize damage.
Introducing PallyCon DR System
PallyCon DR system uses AWS Seoul region as the main system in normal condition. When it detects a failure of the main system through the health check function of Seoul region, it automatically switches the service to the backup system in Tokyo region.
|Cycle||30 seconds (minimum 10 seconds possible)|
|Method||Check whether the database connection state of the region is normal through a specific API such as DRM license request URL|
|Failover condition||If a service failure is continuously detected for 3 minutes, it will be switched to Tokyo region. Then, if the disruption of Seoul region is recovered and the normal state of service is continuously detected for 3 minutes, it returns to Seoul region again.|
DR Server Architecture and Restrictions
The database used by PallyCon service is replicated in real-time with a cross-region replica. When the service is running in Tokyo region due to a fault, it is possible to inquire existing information and issue licenses in a 'Read Only' state. This backup system minimizes the impact of the regional failure on PallyCon's customers.
However, it is not possible to write new data such as content packaging info during the failure, because processing multi-master in the inter-regional database is not supported.
The backup system in the Tokyo region basically runs one instance of each major servers, but it can be expanded automatically by auto-scaling depending on the traffic.
Daniel is a DRM specialist and has been associated with this industry for over 10 years. Other than this, he is addicted to reading and writing.