Embedded Infinispan metrics for multiple sites deployments

Learn about metrics monitoring caching health

Prerequisites

  • Metrics need to be enabled for Keycloak. Follow the Enabling Keycloak Metrics guide for more details.

  • A monitoring system collecting the metrics.

Metrics

Global tags

cache=<name>

The cache name.

Size

Monitor the number of entries in your cache using these two metrics. If the cache is clustered, each entry has an owner node and zero or more backup copies of different nodes.

Sum the unique entry size metric to get a cluster total number of entries.
Metric Description

vendor_statistics_approximate_entries

The approximate number of entries stored by the node, including backup copies.

vendor_statistics_approximate_entries_unique

The approximate number of entries stored by the node, excluding backup copies.

Data Access

The following metrics monitor the cache accesses, such as the reads, writes and their duration.

Stores

A store operation is a write operation that writes or updates a value stored in the cache.

Metric Description

vendor_statistics_store_times_seconds_count

The total number of store requests.

vendor_statistics_store_times_seconds_sum

The total duration of all store requests.

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

Reads

A read operation reads a value from the cache. It divides into two groups, a hit if a value is found, and a miss if not found.

Metric Description

vendor_statistics_hit_times_seconds_count

The total number of read hits requests.

vendor_statistics_hit_times_seconds_sum

The total duration of all read hits requests.

vendor_statistics_miss_times_seconds_count

The total number of read misses requests.

vendor_statistics_miss_times_seconds_sum

The total duration of all read misses requests.

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

Removes

A remove operation removes a value from the cache. It divides in two groups, a hit if a value exists, and a miss if the value does not exist.

Metric Description

vendor_statistics_remove_hit_times_seconds_count

The total number of remove hits requests.

vendor_statistics_remove_hit_times_seconds_sum

The total duration of all remove hits requests.

vendor_statistics_remove_miss_times_seconds_count

The total number of remove misses requests.

vendor_statistics_remove_miss_times_seconds_sum

The total duration of all remove misses requests.

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

For users and realms cache, the database invalidation translates into a remove operation. These metrics are a good indicator of how frequent the database entities are modified and therefore removed from the cache.

Hit Ratio for read and remove operations

An expression can be used to compute the hit ratio for a cache in systems such as Prometheus. As an example, the hit ratio for read operations can be expressed as:

vendor_statistics_hit_times_seconds_count
/
(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count)

Read/Write ratio

An expression can be used to compute the read-write ratio for a cache, using the metrics above:

(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count)
/
(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count
 + vendor_statistics_remove_hit_times_seconds_count
 + vendor_statistics_remove_miss_times_seconds_count
 + vendor_statistics_store_times_seconds_count)

Eviction

Eviction is the process to limit the cache size and, when full, an entry is removed to make room for a new entry to be cached. As Keycloak caches the database entities in the users, realms and authorization, database access always proceeds with an eviction event.

Metric Description

vendor_statistics_evictions

The total number of eviction events.

Eviction rate

A rapid increase of eviction and very high database CPU usage means the users or realms cache is too small for smooth Keycloak operation, as data needs to be re-loaded very often from the database which slows down responses. If enough memory is available, consider increasing the max cache size using the CLI options cache-embedded-users-max-count or cache-embedded-realms-max-count

Transactions

Transactional caches use both One-Phase-Commit and Two-Phase-Commit protocols to complete a transaction. These metrics keep track of the operation duration.

The PESSMISTIC locking mode uses One-Phase-Commit and does not create commit requests.
In a healthy cluster, the number of rollbacks should remain zero. Deadlocks should be rare, but they increase the number of rollbacks.
Metric Description

vendor_transactions_prepare_times_seconds_count

The total number of prepare requests.

vendor_transactions_prepare_times_seconds_sum

The total duration of all prepare requests.

vendor_transactions_rollback_times_seconds_count

The total number of rollback requests.

vendor_transactions_rollback_times_seconds_sum

The total duration of all rollback requests.

vendor_transactions_commit_times_seconds_count

The total number of commit requests.

vendor_transactions_commit_times_seconds_sum

The total duration of all commit requests.

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.
On this page