Keycloak with external Infinispan deployment

This document contains details of the Keycloak metrics that can be used to monitor your deployment’s performance.

The deployment described in this document is for multi-site deployments. In this architecture, Keycloak nodes use an external Infinispan to store the cached data.

If your deployment does not use an external Infinispan, check Keycloak cluster deployment guide.

Enable Keycloak metrics

Keycloak exposes metrics on the management interface endpoint /metrics. To enable, use the build time option --metrics-enabled=true.

On a Kubernetes cluster, using the Keycloak Operator, metrics can be enabled by in the Keycloak CR addionalOptions as shown below:

apiVersion: k8s.keycloak.org/v2alpha1
kind: Keycloak
metadata:
  labels:
    app: keycloak
  name: keycloak
spec:
  additionalOptions:
    - name: metrics-enabled
      value: 'true'

Additional information can be found here.

Keycloak HTTP metrics

This section describes metrics for monitoring the Keycloak HTTP requests processing.

Processing time

The processing time is exposed by these metrics, to monitor the Keycloak performance and how long it takes to processing the requests.

On a healthy cluster, the average processing time will remain stable. Spikes or increases in the processing time may be an early sign that some node is under load.

Tags

  • outcome: A more general outcome tag.

  • status: The HTTP status code.

  • uri: The requested URI.

Metric Description

http_server_requests_seconds_count

The total number of requests processed.

http_server_requests_seconds_sum

The total duration for all the requests processed.

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

Active requests

The current number of active requests is also available.

Metric Description

http_server_active_requests

The current number of active requests

Bandwidth

The metrics below helps to monitor the bandwidth and consumed traffic used by Keycloak and consumed by the requests and responses received or sent.

Metric Description

http_server_bytes_written_count

The total number of responses sent.

http_server_bytes_written_sum

The total number of bytes sent.

http_server_bytes_read_count

The total number of requests received.

http_server_bytes_read_sum

The total number of bytes received.

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

Infinispan Caches

Keycloak caches data in embedded Infinispan caches. The metrics in this section help monitor the caching health.

Global tags

  • cache=<name>: The cache name.

Size

Monitor the number of entries in your cache using these two metrics. If the cache is clustered, each entry has an owner node and zero or more backup copies of different nodes.

Sum the unique entry size metric to get a cluster total number of entries.
Metric Description

vendor_statistics_approximate_entries

The approximate number of entries stored by the node, including backup copies.

vendor_statistics_approximate_entries_unique

The approximate number of entries stored by the node, excluding backup copies.

Data Access

The following metrics monitor the cache accesses, such as the reads, writes and their duration.

Stores

A store operation is a write operation that writes or updates a value stored in the cache.

Metric Description

vendor_statistics_store_times_seconds_count

The total number of store requests.

vendor_statistics_store_times_seconds_sum

The total duration of all store requests.

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

Reads

A read operation reads a value from the cache. It divides into two groups, a hit if a value is found, and a miss if not found.

Metric Description

vendor_statistics_hit_times_seconds_count

The total number of read hits requests.

vendor_statistics_hit_times_seconds_sum

The total duration of all read hits requests.

vendor_statistics_miss_times_seconds_count

The total number of read misses requests.

vendor_statistics_miss_times_seconds_sum

The total duration of all read misses requests.

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

Removes

A remove operation removes a value from the cache. It divides in two groups, a hit if a value exists, and a miss if the value does not exist.

Metric Description

vendor_statistics_remove_hit_times_seconds_count

The total number of remove hits requests.

vendor_statistics_remove_hit_times_seconds_sum

The total duration of all remove hits requests.

vendor_statistics_remove_miss_times_seconds_count

The total number of remove misses requests.

vendor_statistics_remove_miss_times_seconds_sum

The total duration of all remove misses requests.

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

For users and realms cache, the database invalidation translates into a remove operation. These metrics are a good indicator of how frequent the database entities are modified and therefore removed from the cache.

Hit Ratio for read and remove operations

An expression can be used to compute the hit ratio for a cache in systems such as Prometheus. As an example, the hit ratio for read operations can be expressed as:

vendor_statistics_hit_times_seconds_count/(vendor_statistics_hit_times_seconds_count+vendor_statistics_miss_times_seconds_count)

Read/Write ratio

An expression can be used to compute the read-write ratio for a cache, using the metrics above:

(vendor_statistics_hit_times_seconds_count+vendor_statistics_miss_times_seconds_count)/(vendor_statistics_hit_times_seconds_count+vendor_statistics_miss_times_seconds_count+vendor_statistics_remove_hit_times_seconds_count+vendor_statistics_remove_miss_times_seconds_count+vendor_statistics_store_times_seconds_count)

Eviction

Eviction is the process to limit the cache size and, when full, an entry is removed to make room for a new entry to be cached. As Keycloak caches the database entities in the users, realms and authorization, database access always proceeds with an eviction event.

Metric Description

vendor_statistics_evictions

The total number of eviction events.

Eviction rate

A rapid increase of eviction and very high database CPU usage means the users or realms cache is too small for smooth Keycloak operation, as data needs to be re-loaded very often from the database which slows down responses. If enough memory is available, consider increasing the max cache size using the CLI options cache-embedded-users-max-count or cache-embedded-realms-max-count

Transactions

Transactional caches use both One-Phase-Commit and Two-Phase-Commit protocols to complete a transaction. These metrics keep track of the operation duration.

The PESSMISTIC locking mode uses One-Phase-Commit and does not create commit requests.
In a healthy cluster, the number of rollbacks should remain zero. Deadlocks should be rare, but they increase the number of rollbacks.
Metric Description

vendor_transactions_prepare_times_seconds_count

The total number of prepare requests.

vendor_transactions_prepare_times_seconds_sum

The total duration of all prepare requests.

vendor_transactions_rollback_times_seconds_count

The total number of rollback requests.

vendor_transactions_rollback_times_seconds_sum

The total duration of all rollback requests.

vendor_transactions_commit_times_seconds_count

The total number of commit requests.

vendor_transactions_commit_times_seconds_sum

The total duration of all commit requests.

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.