External Infinispan deployment

This document contains details of the external Infinispan metrics that can be used to monitor your deployment’s performance

Enabled Infinispan server metrics

Infinispan exposes metrics in the endpoint /metrics. By default, they are enabled. We recommend enabling the attribute name-as-tags as it makes the metrics name independent on the cache name.

To configure metrics in the Infinispan server, just enabled as shown in the XML below.

infinispan.xml
<infinispan>
    <cache-container statistics="true">
        <metrics gauges="true" histograms="false" name-as-tags="true" />
    </cache-container>
</infinispan>

Using the Infinispan Operator in Kubernetes, metrics can be enabled by using a ConfigMap with a custom configuration. It is shown below an example.

ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-config
data:
  infinispan-config.yaml: >
    infinispan:
      cacheContainer:
        metrics:
          gauges: true
          namesAsTags: true
          histograms: false
infinispan.yaml CR
apiVersion: infinispan.org/v1
kind: Infinispan
metadata:
  name: infinispan
  annotations:
    infinispan.org/monitoring: 'true' (1)
spec:
  configMapName: "cluster-config" (2)
1 Enables monitoring for the deployment
2 Sets the ConfigMap name with the custom configuration.

Additional information can be found here and here.

Clustering and Network

This section describes metrics that are useful for monitoring the communication between Infinispan nodes to identify possible network issues.

Global tags

  • cluster=<name>: The cluster name. If metrics from multiple clusters are being collected, this tag helps identify where they belong to.

  • node=<node>: The name of the node reporting the metric.

Response Time

The following metrics expose the response time for the remote requests. The response time is measured between two nodes and includes the processing time. All requests are measured by these metrics, and the response time should remain stable through the cluster lifecycle.

In a healthy cluster, the response time will remain stable. An increase in response time may indicate a degraded cluster or a node under heavy load.

Tags

  • node=<node>: It identifies the sender node.

  • target_node=<node>: It identifies the receiver node.

Metric Description

vendor_jgroups_stats_sync_requests_seconds_count

The number of synchronous requests to a receiver node.

vendor_jgroups_stats_sync_requests_seconds_sum

The total duration of synchronous request to a receiver node

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

Bandwidth

All the bytes received and sent by the Infinispan are collected by these metrics. Also, all the internal messages, as heartbeats, are counted too. They allow computing the bandwidth currently used by each node.

The metric name depends on the JGroups transport protocol in use.
Metric Protocol Description

vendor_jgroups_tcp_get_num_bytes_received

TCP

The total number of bytes received by a node.

vendor_jgroups_udp_get_num_bytes_received

UDP

vendor_jgroups_tunnel_get_num_bytes_received

TUNNEL

vendor_jgroups_tcp_get_num_bytes_sent

TCP

The total number of bytes sent by a node.

vendor_jgroups_tunnel_get_num_bytes_sent

UDP

vendor_jgroups_tunnel_get_num_bytes_sent

TUNNEL

Thread Pool

Monitoring the thread pool size is a good indicator that a node is under a heavy load. All requests received are added to the thread pool for processing and, when it is full, the request is discarded. A retransmission mechanism ensures a reliable communication with an increase of resource usage.

In a healthy cluster, the thread pool should never be closer to its maximum size (by default, 200 threads).
Thread pool metrics are not available with virtual threads.
The metric name depends on the JGroups transport protocol in use.
Metric Protocol Description

vendor_jgroups_tcp_get_thread_pool_size

TCP

Current number of threads in the thread pool.

vendor_jgroups_udp_get_thread_pool_size

UDP

vendor_jgroups_tunnel_get_thread_pool_size

TUNNEL

vendor_jgroups_tcp_get_largest_size

TCP

The largest number of threads that have ever simultaneously been in the pool.

vendor_jgroups_udp_get_largest_size

UDP

vendor_jgroups_tunnel_get_largest_size

TUNNEL

Flow Control

Flow control takes care of adjusting the rate of a message sender to the rate of the slowest receiver over time. This is implemented through a credit-based system, where each sender decrements its credits when sending. The sender blocks when the credits fall below 0, and only resumes sending messages when it receives a replenishment message from the receivers.

The metrics below show the number of blocked messages and the average blocking time. When a value is different from zero, it may signal that a receiver is overloaded and may degrade the cluster performance.

Each node has two independent flow control protocols, UFC for unicast messages and MFC for multicast messages.

A healthy cluster shows a value of zero for all metrics.
Metric Description

vendor_jgroups_ufc_get_number_of_blockings

The number of times flow control blocks the sender for unicast messages.

vendor_jgroups_ufc_get_average_time_blocked

Average time blocked (in ms) in flow control when trying to send an unicast message.

vendor_jgroups_mfc_get_number_of_blockings

The number of times flow control blocks the sender for multicast messages.

vendor_jgroups_mfc_get_average_time_blocked

Average time blocked (in ms) in flow control when trying to send a multicast message.

Retransmissions

JGroups provides reliable delivery of messages. When a message is dropped on the network, or the receiver cannot handle the message, a retransmission is required. Retransmissions increase resource usage, and it is usually a signal of an overload system.

Random Early Drop (RED) monitors the sender queues. When the queues are almost full, the message is dropped, and a retransmission must happen. It prevents threads from being blocked by a full sender queue.

A healthy cluster shows a value of zero for all metrics.
Metric Description

vendor_jgroups_unicast3_get_num_xmits

The number of retransmitted messages.

vendor_jgroups_red_get_dropped_messages

The total number of dropped messages by the sender.

vendor_jgroups_red_get_drop_rate

Percentage of all messages that were dropped by the sender.

Network Partitions

Cluster Size

The cluster size metric reports the number of nodes present in the cluster. If it differs, it may signal that a node is joining, shutdown or, in the worst case, a network partition is happening.

A healthy cluster shows the same value in all nodes.
Metric Description

vendor_cluster_size

The number of nodes in the cluster.

Cross-Site Status

The cross-site status reports connection status to the other site. It returns a value of 1 if is online or 0 if offline. The value of 2 is used on nodes where the status is unknown; not all nodes establish connections to the remote sites and do not contain this information.

A healthy cluster shows a value greater than zero.
Metric Description

vendor_jgroups_site_view_status

The single site status (1 if online).

Tags

  • site=<name>: The name of the destination site.

Network Partition Events

Network partitions in a cluster can happen due to various reasons. This metrics does not help predict network splits but signals that it happened, and the cluster has been merged.

A healthy cluster shows a value of zero for this metric.
Metric Description

vendor_jgroups_merge3_get_num_merge_events

The amount of time a network split was detected and healed.

Infinispan Caches

The metrics in this section help monitoring the Infinispan caches health and the cluster replication.

Global tags

  • cache=<name>: The cache name.

Size

Monitor the number of entries in your cache using these two metrics. If the cache is clustered, each entry has an owner node and zero or more backup copies of different nodes.

Sum the unique entry size metric to get a cluster total number of entries.
Metric Description

vendor_statistics_approximate_entries

The approximate number of entries stored by the node, including backup copies.

vendor_statistics_approximate_entries_unique

The approximate number of entries stored by the node, excluding backup copies.

Data Access

The following metrics monitor the cache accesses, such as the reads, writes and their duration.

Stores

A store operation is a write operation that writes or updates a value stored in the cache.

Metric Description

vendor_statistics_store_times_seconds_count

The total number of store requests.

vendor_statistics_store_times_seconds_sum

The total duration of all store requests.

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

Reads

A read operation reads a value from the cache. It divides into two groups, a hit if a value is found, and a miss if not found.

Metric Description

vendor_statistics_hit_times_seconds_count

The total number of read hits requests.

vendor_statistics_hit_times_seconds_sum

The total duration of all read hits requests.

vendor_statistics_miss_times_seconds_count

The total number of read misses requests.

vendor_statistics_miss_times_seconds_sum

The total duration of all read misses requests.

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

Removes

A remove operation removes a value from the cache. It divides in two groups, a hit if a value exists, and a miss if the value does not exist.

Metric Description

vendor_statistics_remove_hit_times_seconds_count

The total number of remove hits requests.

vendor_statistics_remove_hit_times_seconds_sum

The total duration of all remove hits requests.

vendor_statistics_remove_miss_times_seconds_count

The total number of remove misses requests.

vendor_statistics_remove_miss_times_seconds_sum

The total duration of all remove misses requests.

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

Hit Ratio for read and remove operations

An expression can be used to compute the hit ratio for a cache in systems such as Prometheus. As an example, the hit ratio for read operations can be expressed as:

vendor_statistics_hit_times_seconds_count/(vendor_statistics_hit_times_seconds_count+vendor_statistics_miss_times_seconds_count)

Read/Write ratio

An expression can be used to compute the read-write ratio for a cache, using the metrics above:

(vendor_statistics_hit_times_seconds_count+vendor_statistics_miss_times_seconds_count)/(vendor_statistics_hit_times_seconds_count+vendor_statistics_miss_times_seconds_count+vendor_statistics_remove_hit_times_seconds_count+vendor_statistics_remove_miss_times_seconds_count+vendor_statistics_store_times_seconds_count)

Locking

Write and remove operations hold the lock until the value is replicated in the local cluster and to the remote site.

On a healthy cluster, the number of locks held should remain constant, but deadlocks may create temporary spikes.
Metric Description

vendor_lock_manager_number_of_locks_held

The number of locks currently being held by this node.

Transactions

Transactional caches use both One-Phase-Commit and Two-Phase-Commit protocols to complete a transaction. These metrics keep track of the operation duration.

The PESSMISTIC locking mode uses One-Phase-Commit and does not create commit requests.
In a healthy cluster, the number of rollbacks should remain zero. Deadlocks should be rare, but they increase the number of rollbacks.
Metric Description

vendor_transactions_prepare_times_seconds_count

The total number of prepare requests.

vendor_transactions_prepare_times_seconds_sum

The total duration of all prepare requests.

vendor_transactions_rollback_times_seconds_count

The total number of rollback requests.

vendor_transactions_rollback_times_seconds_sum

The total duration of all rollback requests.

vendor_transactions_commit_times_seconds_count

The total number of commit requests.

vendor_transactions_commit_times_seconds_sum

The total duration of all commit requests.

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

State Transfer

State transfer happens when a node joins or leaves the cluster. It is required to balance the data stored and guarantee the desired number of copies.

This operation increases the resource usage, and it will affect negatively the overall performance.

Metric Description

vendor_state_transfer_manager_inflight_transactional_segment_count

The number of in-flight transactional segments the local node requested from other nodes.

vendor_state_transfer_manager_inflight_segment_transfer_count

The number of in-flight segments the local node requested from other nodes.

Cluster Data Replication

The cluster data replication can be the main source of failure. These metrics not only report the response time, i.e., the time it takes to replicate an update, but also the failures.

On a healthy cluster, the average replication time will be stable or with little variance. The number of failures should not increase.
Metric Description

vendor_rpc_manager_replication_count

The total number of successful replications.

vendor_rpc_manager_replication_failures

The total number of failed replications.

vendor_rpc_manager_average_replication_time

The average time spent, in milliseconds, replicating data in the cluster.

Success ratio

An expression can be used to compute the replication success ratio:

(vendor_rpc_manager_replication_count)/(vendor_rpc_manager_replication_count+vendor_rpc_manager_replication_failures)

Cross Site Data Replication

Like cluster data replication, the metrics in this section measure the time it takes to replicate the data to the other sites.

On a healthy cluster, the average cross-site replication time will be stable or with little variance.

Tags

  • site=<name>: indicates the receiving site.

Metric Description

vendor_rpc_manager_cross_site_replication_times_seconds_count

The total number of cross-site requests.

vendor_rpc_manager_cross_site_replication_times_seconds_sum

The total duration of all cross-site requests.

vendor_rpc_manager_replication_times_to_site_seconds_count

The total number of cross-site requests. This metric is more detailed with a per-site counter.

vendor_rpc_manager_replication_times_to_site_seconds_sum

The total duration of all cross-site requests. This metric is more detailed with a per-site duration.

vendor_rpc_manager_number_xsite_requests_received_from_site

The total number of cross-site requests handled by this node. This metric is more detailed with a per-site counter.

vendor_x_site_admin_status

The site status. A value of 1 indicates that it is online. This value reacts to the Infinispan CLI commands bring-online and take-offline.

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.