Clustering metrics

Metrics

Deploying multiple Keycloak nodes allows the load to be distributed amongst them, but this requires communication between the nodes. This section describes metrics that are useful for monitoring the communication between Keycloak in order to identify possible faults.

This is relevant only for single-cluster deployments. When multiple clusters are used, as described in Multi-cluster deployments, Keycloak nodes are not clustered together and therefore there is no communication between them directly.

Global tags

cluster=<name>: The cluster name. If metrics from multiple clusters are being collected, this tag helps identify where they belong to.
node=<node>: The name of the node reporting the metric.

All metric names prefixed with vendor_jgroups_ are provided for troubleshooting and debugging purposes only. The metric names can change in upcoming releases of Keycloak without further notice. Therefore, we advise not using them in dashboards or in monitoring and alerting.

Response Time

The following metrics expose the response time for the remote requests. The response time is measured between two nodes and includes the processing time. All requests are measured by these metrics, and the response time should remain stable through the cluster lifecycle.

In a healthy cluster, the response time will remain stable. An increase in response time may indicate a degraded cluster or a node under heavy load.

Tags

node=<node>: It identifies the sender node.
target_node=<node>: It identifies the receiver node.

Metric Description

Metric	Description
`vendor_jgroups_stats_sync_requests_seconds_count`	The number of synchronous requests to a receiver node.
`vendor_jgroups_stats_sync_requests_seconds_sum`	The total duration of synchronous request to a receiver node

vendor_jgroups_stats_sync_requests_seconds_count

The number of synchronous requests to a receiver node.

vendor_jgroups_stats_sync_requests_seconds_sum

The total duration of synchronous request to a receiver node

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

Bandwidth

All the bytes received and sent by the Keycloak are collected by these metrics. Also, all the internal messages, as heartbeats, are counted too. They allow computing the bandwidth currently used by each node.

The metric name depends on the JGroups transport protocol in use.

Metric Protocol Description

Metric	Protocol	Description
`vendor_jgroups_tcp_get_num_bytes_received`	`TCP`	The total number of bytes received by a node.
`vendor_jgroups_udp_get_num_bytes_received`	`UDP`
`vendor_jgroups_tunnel_get_num_bytes_received`	`TUNNEL`
`vendor_jgroups_tcp_get_num_bytes_sent`	`TCP`	The total number of bytes sent by a node.
`vendor_jgroups_udp_get_num_bytes_sent`	`UDP`
`vendor_jgroups_tunnel_get_num_bytes_sent`	`TUNNEL`

vendor_jgroups_tcp_get_num_bytes_received

TCP

The total number of bytes received by a node.

vendor_jgroups_udp_get_num_bytes_received

UDP

vendor_jgroups_tunnel_get_num_bytes_received

TUNNEL

vendor_jgroups_tcp_get_num_bytes_sent

TCP

The total number of bytes sent by a node.

vendor_jgroups_udp_get_num_bytes_sent

UDP

vendor_jgroups_tunnel_get_num_bytes_sent

TUNNEL

Thread Pool

Monitoring the thread pool size is a good indicator that a node is under a heavy load. All requests received are added to the thread pool for processing and, when it is full, the request is discarded. A retransmission mechanism ensures a reliable communication with an increase of resource usage.

In a healthy cluster, the thread pool should never be closer to its maximum size (by default, 200 threads).

Thread pool metrics are not available with virtual threads. Virtual threads are enabled by default when running with OpenJDK 21 and later.

The metric name depends on the JGroups transport protocol in use. The default transport protocol is TCP.

Metric Protocol Description

Metric	Protocol	Description
`vendor_jgroups_tcp_get_thread_pool_size`	`TCP`	Current number of threads in the thread pool.
`vendor_jgroups_udp_get_thread_pool_size`	`UDP`
`vendor_jgroups_tunnel_get_thread_pool_size`	`TUNNEL`
`vendor_jgroups_tcp_get_largest_size`	`TCP`	The largest number of threads that have ever simultaneously been in the pool.
`vendor_jgroups_udp_get_largest_size`	`UDP`
`vendor_jgroups_tunnel_get_largest_size`	`TUNNEL`

vendor_jgroups_tcp_get_thread_pool_size

TCP

Current number of threads in the thread pool.

vendor_jgroups_udp_get_thread_pool_size

UDP

vendor_jgroups_tunnel_get_thread_pool_size

TUNNEL

vendor_jgroups_tcp_get_largest_size

TCP

The largest number of threads that have ever simultaneously been in the pool.

vendor_jgroups_udp_get_largest_size

UDP

vendor_jgroups_tunnel_get_largest_size

TUNNEL

Flow Control

Flow control takes care of adjusting the rate of a message sender to the rate of the slowest receiver over time. This is implemented through a credit-based system, where each sender decrements its credits when sending. The sender blocks when the credits fall below 0, and only resumes sending messages when it receives a replenishment message from the receivers.

The metrics below show the number of blocked messages and the average blocking time. When a value is different from zero, it may signal that a receiver is overloaded and may degrade the cluster performance.

Each node has two independent flow control protocols, UFC for unicast messages and MFC for multicast messages.

A healthy cluster shows a value of zero for all metrics.

Metric Description

Metric	Description
`vendor_jgroups_ufc_get_number_of_blockings`	The number of times flow control blocks the sender for unicast messages.
`vendor_jgroups_ufc_get_average_time_blocked`	Average time blocked (in ms) in flow control when trying to send an unicast message.
`vendor_jgroups_mfc_get_number_of_blockings`	The number of times flow control blocks the sender for multicast messages.
`vendor_jgroups_mfc_get_average_time_blocked`	Average time blocked (in ms) in flow control when trying to send a multicast message.

vendor_jgroups_ufc_get_number_of_blockings

The number of times flow control blocks the sender for unicast messages.

vendor_jgroups_ufc_get_average_time_blocked

Average time blocked (in ms) in flow control when trying to send an unicast message.

vendor_jgroups_mfc_get_number_of_blockings

The number of times flow control blocks the sender for multicast messages.

vendor_jgroups_mfc_get_average_time_blocked

Average time blocked (in ms) in flow control when trying to send a multicast message.

Retransmissions

JGroups provides reliable delivery of messages. When a message is dropped on the network, or the receiver cannot handle the message, a retransmission is required. JGroups delays the acknowledgment of delivered messages, to reduce the amount of acknowledge messages in transit. This choice has a side effect: the sender may eagerly retransmit message(s) if the acknowledges are delayed for too long. An increase in the retransmission metric value does not immediately mean an unhealthy cluster.

Random Early Drop (RED) monitors the sender queues. When the queues are almost full, the message is dropped, and a retransmission must happen. It prevents threads from being blocked by a full sender queue.

A healthy cluster typically shows a value of zero or very low values for all metrics. Occasional retransmissions may occur due to timing issues between the sender and the receiver.

Metric Description

Metric	Description
`vendor_jgroups_unicast3_get_num_xmits`	The number of retransmitted messages.
`vendor_jgroups_red_get_dropped_messages`	The total number of dropped messages by the sender.
`vendor_jgroups_red_get_drop_rate`	Percentage of all messages that were dropped by the sender.

vendor_jgroups_unicast3_get_num_xmits

The number of retransmitted messages.

vendor_jgroups_red_get_dropped_messages

The total number of dropped messages by the sender.

vendor_jgroups_red_get_drop_rate

Percentage of all messages that were dropped by the sender.

Network Partitions

Cluster Size

The cluster size metric reports the number of nodes present in the cluster. If it differs, it may signal that a node is joining, shutdown or, in the worst case, a network partition is happening.

A healthy cluster shows the same value in all nodes.

Metric Description

Metric	Description
`vendor_cluster_size`	The number of nodes in the cluster.

vendor_cluster_size

The number of nodes in the cluster.

Network Partition Events

Network partitions in a cluster can happen due to various reasons. This metrics does not help predict network splits but signals that it happened, and the cluster has been merged.

A healthy cluster shows a value of zero for this metric.

Metric Description

Metric	Description
`vendor_jgroups_merge3_get_num_merge_events`	The amount of time a network split was detected and healed.

vendor_jgroups_merge3_get_num_merge_events

The amount of time a network split was detected and healed.

Nightly release

Clustering metrics

Prerequisites

Metrics

Response Time

Bandwidth

Thread Pool

Flow Control

Retransmissions

Network Partitions

Cluster Size

Network Partition Events

Next steps