AWS Route 53 as loadbalancer for ROSA

This guide describes how to achieve automatic client-failover when a Keycloak deployment when a given site fails.

Route 53 for Client Failover

To provide client failover, we can leverage AWS Route 53's DNS failover capabilities to automatically re-route traffic when the primary site is down. A health check on AWS checks every 30 seconds if a site is responding, and the DNS as seen by the client will update accordingly.

This script generates a subdomain name, with three further host entries for our root domain keycloak-benchmark.com.

primary.<generated-subdomain>.keycloak-benchmark.com

Subdomain for Keycloak site 1

backup.<generated-subdomain>.keycloak-benchmark.com

Subdomain for Keycloak site 2

client.<generated-subdomain>.keycloak-benchmark.com

Subdomain used by Keycloak clients, that will automatically fail over from 1 to 2 in the event of a failure.

Those DNS entries are registered with the OpenShift clusters so that they respond to requests to that host names. After the setup, the Keycloak deployment is updated to use the new hostnames.

See below for the newly created elements (green) and the updated elements (yellow).

route 53 configuration.dio

To ensure that a failed Primary site cannot be marked healthy without user input, e.g. in the event of an automated ROSA cluster restart, we create an AWS SNS topic which fires an event whenever the Primary health check fails. This topic is used to trigger an AWS Lambda function which updates the health check to point to a non-existent endpoint /lb-check-failed-over. In order to failback from the Backup to the Primary cluster, it’s necessary for the health check to be manually updated back to /lb-check.

Setup new Route 53 failover

Prerequisites

A hosted zone already exists for keycloak-benchmark.com

Procedure

  1. Create two ROSA clusters

  2. Create subdomain records and health Checks

    PRIMARY_CLUSTER=<name-rosa-cluster> \
    BACKUP_CLUSTER=<name-of-rosa_cluster> \
    ./provision/aws/route53/route53_create.sh

    Note down the domain and URLs generated by the script for the following steps. The generated part of the subdomain name allows for multiple Keycloak instances in the different clusters.

    Domain: <generated-subdomain>.keycloak-benchmark.com
    Client Site URL: client.<generated-subdomain>.keycloak-benchmark.com
    Primary Site URL: primary.<generated-subdomain>.keycloak-benchmark.com
    Backup Site URL: backup.<generated-subdomain>.keycloak-benchmark.com
  3. Deploy Keycloak as normal, but with the following environment variables set.

    1. Primary cluster:

      KC_HOSTNAME_OVERRIDE=client.<generated-subdomain>.keycloak-benchmark.com # Hostname used by clients
      KC_HEALTH_HOSTNAME=primary.<generated-subdomain>.keycloak-benchmark.com # Hostname used by AWS health checks
    2. Backup cluster:

      KC_HOSTNAME_OVERRIDE=client.<generated-subdomain>.keycloak-benchmark.com # Hostname used by clients
      KC_HEALTH_HOSTNAME=backup.<generated-subdomain>.keycloak-benchmark.com # Hostname used by AWS health checks

Testing Failover

To test failover from primary to the backup site, do the following:

  1. Verify that client.<generated-subdomain>.keycloak-benchmark.com connects to primary.

    ./provision/aws/route53/route53_test_primary_used.sh <generated-subdomain>.keycloak-benchmark.com; echo $?

    The script returns 0 if the client. subdomain is pointing to the same IP as primary. subdomain.

    This script will fail if the PRIMARY_CLUSTER and BACKUP_CLUSTER are set to the same ROSA cluster.
  2. Login to the primary ROSA cluster and delete the aws-health-route Route from the keycloak namespace.

  3. Wait for about 30 seconds for the Health Checks to determine that primary.<generated-subdomain>.keycloak-benchmark.com is no longer healthy. This can be confirmed by inspecting the health check in the AWS console.

  4. Execute the script from the first step and an exit code of 1 should be returned.

Testing Failback

To Failback from the Backup to the Primary cluster, it’s necessary to perform the following:

  1. Recreate the aws-health-route on the PRIMARY_CLUSTER

  2. Update the ResourcePath of the Primary clusters Route53 health check to /lb-check.

Once both actions have been taken the health check will eventually pass and the client. record will revert to routing requests to the primary cluster.

Remove Route 53 Failover

To delete the generated subdomain including the health checks, run the following command.

SUBDOMAIN=<generated-subdomain> \
./provision/aws/route53/route53_delete.sh