Keycloak cluster deployment
This document contains details of the metrics to monitor your Keycloak deployment’s performance.
The deployment described in this document is for single-site deployments. In this architecture, Keycloak nodes leverage Infinispan embedded clustered caches to form a cluster.
If an external Infinispan is used, check Keycloak with external Infinispan deployment guide.
Enable Keycloak metrics
Keycloak exposes metrics on the management interface endpoint /metrics
.
To enable, use the build time option --metrics-enabled=true
.
On a Kubernetes cluster, using the Keycloak Operator, metrics can be enabled by in the Keycloak CR addionalOptions
as shown below:
apiVersion: k8s.keycloak.org/v2alpha1
kind: Keycloak
metadata:
labels:
app: keycloak
name: keycloak
spec:
additionalOptions:
- name: metrics-enabled
value: 'true'
Additional information can be found here.
Keycloak HTTP metrics
This section describes metrics for monitoring the Keycloak HTTP requests processing.
Processing time
The processing time is exposed by these metrics, to monitor the Keycloak performance and how long it takes to processing the requests.
On a healthy cluster, the average processing time will remain stable. Spikes or increases in the processing time may be an early sign that some node is under load. |
Tags
-
outcome
: A more general outcome tag. -
status
: The HTTP status code. -
uri
: The requested URI.
Metric | Description |
---|---|
|
The total number of requests processed. |
|
The total duration for all the requests processed. |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance. |
Active requests
The current number of active requests is also available.
Metric | Description |
---|---|
|
The current number of active requests |
Bandwidth
The metrics below helps to monitor the bandwidth and consumed traffic used by Keycloak and consumed by the requests and responses received or sent.
Metric | Description |
---|---|
|
The total number of responses sent. |
|
The total number of bytes sent. |
|
The total number of requests received. |
|
The total number of bytes received. |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance. |
Clustering and Network
Deploying multiple Keycloak nodes allows the load to be distributed amongst them, but this requires communication between the nodes. This section describes metrics that are useful for monitoring the communication between Keycloak in order to identify possible faults.
Global tags
-
cluster=<name>
: The cluster name. If metrics from multiple clusters are being collected, this tag helps identify where they belong to. -
node=<node>
: The name of the node reporting the metric.
Response Time
The following metrics expose the response time for the remote requests. The response time is measured between two nodes and includes the processing time. All requests are measured by these metrics, and the response time should remain stable through the cluster lifecycle.
In a healthy cluster, the response time will remain stable. An increase in response time may indicate a degraded cluster or a node under heavy load. |
Tags
-
node=<node>
: It identifies the sender node. -
target_node=<node>
: It identifies the receiver node.
Metric | Description |
---|---|
|
The number of synchronous requests to a receiver node. |
|
The total duration of synchronous request to a receiver node |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance. |
Bandwidth
All the bytes received and sent by the Keycloak are collected by these metrics. Also, all the internal messages, as heartbeats, are counted too. They allow computing the bandwidth currently used by each node.
The metric name depends on the JGroups transport protocol in use. |
Metric | Protocol | Description |
---|---|---|
|
|
The total number of bytes received by a node. |
|
|
|
|
|
|
|
|
The total number of bytes sent by a node. |
|
|
|
|
|
Thread Pool
Monitoring the thread pool size is a good indicator that a node is under a heavy load. All requests received are added to the thread pool for processing and, when it is full, the request is discarded. A retransmission mechanism ensures a reliable communication with an increase of resource usage.
In a healthy cluster, the thread pool should never be closer to its maximum size (by default, 200 threads).
|
Thread pool metrics are not available with virtual threads. |
The metric name depends on the JGroups transport protocol in use. |
Metric | Protocol | Description |
---|---|---|
|
|
Current number of threads in the thread pool. |
|
|
|
|
|
|
|
|
The largest number of threads that have ever simultaneously been in the pool. |
|
|
|
|
|
Flow Control
Flow control takes care of adjusting the rate of a message sender to the rate of the slowest receiver over time. This is implemented through a credit-based system, where each sender decrements its credits when sending. The sender blocks when the credits fall below 0, and only resumes sending messages when it receives a replenishment message from the receivers.
The metrics below show the number of blocked messages and the average blocking time. When a value is different from zero, it may signal that a receiver is overloaded and may degrade the cluster performance.
Each node has two independent flow control protocols, UFC
for unicast messages and MFC
for multicast messages.
A healthy cluster shows a value of zero for all metrics. |
Metric | Description |
---|---|
|
The number of times flow control blocks the sender for unicast messages. |
|
Average time blocked (in ms) in flow control when trying to send an unicast message. |
|
The number of times flow control blocks the sender for multicast messages. |
|
Average time blocked (in ms) in flow control when trying to send a multicast message. |
Retransmissions
JGroups provides reliable delivery of messages. When a message is dropped on the network, or the receiver cannot handle the message, a retransmission is required. Retransmissions increase resource usage, and it is usually a signal of an overload system.
Random Early Drop (RED) monitors the sender queues. When the queues are almost full, the message is dropped, and a retransmission must happen. It prevents threads from being blocked by a full sender queue.
A healthy cluster shows a value of zero for all metrics. |
Metric | Description |
---|---|
|
The number of retransmitted messages. |
|
The total number of dropped messages by the sender. |
|
Percentage of all messages that were dropped by the sender. |
Network Partitions
Cluster Size
The cluster size metric reports the number of nodes present in the cluster. If it differs, it may signal that a node is joining, shutdown or, in the worst case, a network partition is happening.
A healthy cluster shows the same value in all nodes. |
Metric | Description |
---|---|
|
The number of nodes in the cluster. |
Network Partition Events
Network partitions in a cluster can happen due to various reasons. This metrics does not help predict network splits but signals that it happened, and the cluster has been merged.
A healthy cluster shows a value of zero for this metric. |
Metric | Description |
---|---|
|
The amount of time a network split was detected and healed. |
Infinispan Caches
Keycloak caches data in embedded Infinispan caches. The metrics in this section help to monitor the caching health and the cluster replication.
Global tags
-
cache=<name>
: The cache name.
Size
Monitor the number of entries in your cache using these two metrics. If the cache is clustered, each entry has an owner node and zero or more backup copies of different nodes.
Sum the unique entry size metric to get a cluster total number of entries. |
Metric | Description |
---|---|
|
The approximate number of entries stored by the node, including backup copies. |
|
The approximate number of entries stored by the node, excluding backup copies. |
Data Access
The following metrics monitor the cache accesses, such as the reads, writes and their duration.
Stores
A store operation is a write operation that writes or updates a value stored in the cache.
Metric | Description |
---|---|
|
The total number of store requests. |
|
The total duration of all store requests. |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance. |
Reads
A read operation reads a value from the cache. It divides into two groups, a hit if a value is found, and a miss if not found.
Metric | Description |
---|---|
|
The total number of read hits requests. |
|
The total duration of all read hits requests. |
|
The total number of read misses requests. |
|
The total duration of all read misses requests. |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance. |
Removes
A remove operation removes a value from the cache. It divides in two groups, a hit if a value exists, and a miss if the value does not exist.
Metric | Description |
---|---|
|
The total number of remove hits requests. |
|
The total duration of all remove hits requests. |
|
The total number of remove misses requests. |
|
The total duration of all remove misses requests. |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance. |
For |
Hit Ratio for read and remove operations
An expression can be used to compute the hit ratio for a cache in systems such as Prometheus. As an example, the hit ratio for read operations can be expressed as:
vendor_statistics_hit_times_seconds_count/(vendor_statistics_hit_times_seconds_count+vendor_statistics_miss_times_seconds_count)
Read/Write ratio
An expression can be used to compute the read-write ratio for a cache, using the metrics above:
(vendor_statistics_hit_times_seconds_count+vendor_statistics_miss_times_seconds_count)/(vendor_statistics_hit_times_seconds_count+vendor_statistics_miss_times_seconds_count+vendor_statistics_remove_hit_times_seconds_count+vendor_statistics_remove_miss_times_seconds_count+vendor_statistics_store_times_seconds_count)
Eviction
Eviction is the process to limit the cache size and, when full, an entry is removed to make room for a new entry to be cached.
As Keycloak caches the database entities in the users
, realms
and authorization
, database access always proceeds with an eviction event.
Metric | Description |
---|---|
|
The total number of eviction events. |
Eviction rate
A rapid increase of eviction and very high database CPU usage means the users
or realms
cache is too small for smooth Keycloak operation, as data needs to be re-loaded very often from the database which slows down responses.
If enough memory is available, consider increasing the max cache size using the CLI options cache-embedded-users-max-count
or cache-embedded-realms-max-count
Locking
Write and remove operations hold the lock until the value is replicated in the local cluster and to the remote site.
On a healthy cluster, the number of locks held should remain constant, but deadlocks may create temporary spikes. |
Metric | Description |
---|---|
|
The number of locks currently being held by this node. |
Transactions
Transactional caches use both One-Phase-Commit and Two-Phase-Commit protocols to complete a transaction. These metrics keep track of the operation duration.
The PESSMISTIC locking mode uses One-Phase-Commit and does not create commit requests.
|
In a healthy cluster, the number of rollbacks should remain zero. Deadlocks should be rare, but they increase the number of rollbacks. |
Metric | Description |
---|---|
|
The total number of prepare requests. |
|
The total duration of all prepare requests. |
|
The total number of rollback requests. |
|
The total duration of all rollback requests. |
|
The total number of commit requests. |
|
The total duration of all commit requests. |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance. |
State Transfer
State transfer happens when a node joins or leaves the cluster. It is required to balance the data stored and guarantee the desired number of copies.
This operation increases the resource usage, and it will affect negatively the overall performance.
Metric | Description |
---|---|
|
The number of in-flight transactional segments the local node requested from other nodes. |
|
The number of in-flight segments the local node requested from other nodes. |
Cluster Data Replication
The cluster data replication can be the main source of failure. These metrics not only report the response time, i.e., the time it takes to replicate an update, but also the failures.
On a healthy cluster, the average replication time will be stable or with little variance. The number of failures should not increase. |
Metric | Description |
---|---|
|
The total number of successful replications. |
|
The total number of failed replications. |
|
The average time spent, in milliseconds, replicating data in the cluster. |
Success ratio
An expression can be used to compute the replication success ratio:
(vendor_rpc_manager_replication_count)/(vendor_rpc_manager_replication_count+vendor_rpc_manager_replication_failures)