Deploy an AWS Route 53 Failover Lambda

Building block for loadbalancer resilience

After a Primary cluster has failed over to a Backup cluster due to a health check failure, the Primary must only serve requests again after the SRE team has synchronized the two sites first as outlined in the Switch back to the primary site guide.

If the Primary site would be marked as healthy by the Route 53 Health Check before the sites are synchronized, the Primary Site would start serving requests with outdated session and realm data.

This guide shows how an automatic fallback to a not-yet synchronized Primary site can be prevented with the help of AWS CloudWatch, SNS, and Lambda.

Architecture

In the event of a Primary cluster failure, an AWS CloudWatch alarm sends a message to an AWS SNS topic, which then triggers an AWS Lambda function. The Lambda function updates the Route53 health check of the Primary cluster so that it points to a non-existent path /lb-check-failed-over, thus ensuring that it is impossible for the Primary to be marked as healthy until the path is manually changed back to /lb-check.

Prerequisites

Procedure

  1. Create an SNS topic to trigger a Lambda.

    Command:
    PRIMARY_HEALTH_ID=233e180f-f023-45a3-954e-415303f21eab (1)
    ALARM_NAME=${PRIMARY_HEALTH_ID}
    TOPIC_NAME=${PRIMARY_HEALTH_ID}
    FUNCTION_NAME=${PRIMARY_HEALTH_ID}
    TOPIC_ARN=$(aws sns create-topic --name ${TOPIC_NAME} \
      --query "TopicArn" \
      --tags "Key=HealthCheckId,Value=${PRIMARY_HEALTH_ID}" \
      --region us-east-1 \
      --output text
    )
    1 Replace this with the ID of the Health Check associated with your Primary cluster
  2. Create a CloudWatch alarm to a send message to the SNS topic.

    Command:
    aws cloudwatch put-metric-alarm \
      --alarm-actions ${TOPIC_ARN} \
      --actions-enabled \
      --alarm-name ${ALARM_NAME} \
      --dimensions "Name=HealthCheckId,Value=${PRIMARY_HEALTH_ID}" \
      --comparison-operator LessThanThreshold \
      --evaluation-periods 1 \
      --metric-name HealthCheckStatus \
      --namespace AWS/Route53 \
      --period 60 \
      --statistic Minimum \
      --threshold 1.0 \
      --treat-missing-data notBreaching \
      --region us-east-1
  3. Create the Role used to execute the Lambda.

    Command:
    ROLE_ARN=$(aws iam create-role \
      --role-name ${FUNCTION_NAME} \
      --assume-role-policy-document \
      '{
        "Version": "2012-10-17",
        "Statement": [
          {
            "Effect": "Allow",
            "Principal": {
              "Service": "lambda.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
          }
        ]
      }' \
      --query 'Role.Arn' \
      --region us-east-1 \
      --output text
    )
  4. Create a policy with the permissions required by the Lambda.

    Command:
    POLICY_ARN=$(aws iam create-policy \
      --policy-name ${FUNCTION_NAME} \
      --policy-document \
      '{
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Action": [
                      "route53:UpdateHealthCheck"
                  ],
                  "Resource": "*"
              }
          ]
      }' \
      --query 'Policy.Arn' \
      --region us-east-1 \
      --output text
    )
  5. Attach the custom policy to the Lambda role.

    Command:
    aws iam attach-role-policy \
      --role-name ${FUNCTION_NAME} \
      --policy-arn ${POLICY_ARN} \
      --region us-east-1
  6. Attach the AWSLambdaBasicExecutionRole policy so that the Lambda logs can be written to CloudWatch

    Command:
    aws iam attach-role-policy \
      --role-name ${FUNCTION_NAME} \
      --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole \
      --region us-east-1
  7. Create a Lambda ZIP file.

    Command:
    LAMBDA_ZIP=/tmp/lambda.zip
    cat << EOF > /tmp/lambda.py
    import boto3
    import json
    
    
    def handler(event, context):
        print(json.dumps(event, indent=4))
    
        msg = json.loads(event['Records'][0]['Sns']['Message'])
        healthCheckId = msg['Trigger']['Dimensions'][0]['value']
    
        r53Client = boto3.client("route53")
        response = r53Client.update_health_check(
            HealthCheckId=healthCheckId,
            ResourcePath="/lb-check-failed-over"
        )
    
        print(json.dumps(response, indent=4, default=str))
        statusCode = response['ResponseMetadata']['HTTPStatusCode']
        if statusCode != 200:
            raise Exception("Route 53 Unexpected status code %d" + statusCode)
    
    EOF
    zip -FS --junk-paths ${LAMBDA_ZIP} /tmp/lambda.py
  8. Create the Lambda function.

    Command:
    FUNCTION_ARN=$(aws lambda create-function \
      --function-name ${FUNCTION_NAME} \
      --zip-file fileb://${LAMBDA_ZIP} \
      --handler lambda.handler \
      --runtime python3.11 \
      --role ${ROLE_ARN} \
      --query 'FunctionArn' \
      --region eu-west-1 \(1)
      --output text
    )
    1 Replace with the AWS region hosting your ROSA cluster
  9. Allow the SNS to trigger the Lambda.

    Command:
    aws lambda add-permission \
      --function-name ${FUNCTION_NAME} \
      --statement-id function-with-sns \
      --action 'lambda:InvokeFunction' \
      --principal 'sns.amazonaws.com' \
      --source-arn ${TOPIC_ARN} \
      --region eu-west-1 (1)
    1 Replace with the AWS region hosting your ROSA cluster
  10. Invoke the Lambda when the SNS message is received.

    Command:
    aws sns subscribe --protocol lambda \
      --topic-arn ${TOPIC_ARN} \
      --notification-endpoint ${FUNCTION_ARN} \
      --region us-east-1

Verify

To test the Lambda is triggered as expected, log in to the Primary cluster and scale the Keycloak deployment to zero Pods. Scaling will cause the Primary’s health checks to fail and the following should occur:

  • Route53 should start routing traffic to the Keycloak Pods on the Backup cluster.

  • The Route53 health check for the Primary cluster should have ResourcePath=/lb-check-failed-over

To direct traffic back to the Primary site, scale up the Keycloak deployment and manually revert the changes to the Route53 health check the Lambda has performed.

For more information, see the Switch back to the primary site guide.

On this page