PallyCon DR(Disaster Recovery) System

PallyCon > Content Security  > PallyCon DR(Disaster Recovery) System
Introducing PallyCon DR(Disaster Recover) system

PallyCon DR(Disaster Recovery) System

The stability of online services based on cloud platforms has emerged as a major issue due to the massive failure of the AWS Seoul region in November last year. Traditionally, high-availability (HA) systems have been applied to redundant systems within a region to prevent failures in some systems. However it is impossible to cope with the problems that arise in a whole region with HA systems. Therefore, there is a need for multi-region DR system.This article introduces the application of the multi-regional DR system to PallyCon cloud service to automatically address large-scale failures of cloud platform and minimize damage.

 

Introducing PallyCon DR System

 

PallyCon DR system uses AWS Seoul region as the main system in normal condition. When it detects a failure of the main system through the health check function of Seoul region, it automatically switches the service to the backup system in Tokyo region.

Amazon Route53 periodically checks the service status of the AWS Seoul region to convert the service DNS to the Tokyo region in the event of a failure.

 

Health Check

PallyCon DR System Health check function
Cycle 30 seconds (minimum 10 seconds possible)
Region Seoul, Tokyo
Method Check whether the database connection state of the region is normal through a specific API such as DRM license request URL
Failover condition If a service failure is continuously detected for 3 minutes, it will be switched to Tokyo region. Then, if the disruption of Seoul region is recovered and the normal state of service is continuously detected for 3 minutes, it returns to Seoul region again.
 

 

Alarming

Health Check results are stored in Amazon CloudWatch, and in conjunction with CloudWatch's SNS Alarm function, administrators are notified about disaster recovery processing.

 

DR Server Architecture and Restrictions

The database used by PallyCon service is replicated in real-time with a cross-region replica. When the service is running in Tokyo region due to a fault, it is possible to inquire existing information and issue licenses in a 'Read Only' state. This backup system minimizes the impact of the regional failure on PallyCon's customers.

 

However, it is not possible to write new data such as content packaging info during the failure, because processing multi-master in the inter-regional database is not supported.

 

The backup system in the Tokyo region basically runs one instance of each major servers, but it can be expanded automatically by auto-scaling depending on the traffic.

PallyCon