The group communication engine for Group Replication (XCom, a Paxos variant) includes a cache for messages (and their metadata) exchanged between the group members as a part of the consensus protocol. Among other functions, the message cache is used for recovery by members that return to the group after a period where they were unable to communicate with the other group members.
From MySQL 8.0.16, a cache size limit can be set for XCom's
message cache using the
group_replication_message_cache_size
system variable. This system variable has a default and minimum
setting of 1 GB, which is the size of the message cache in MySQL
Server versions prior to MySQL 8.0.16. If the cache size limit is
reached, XCom removes the oldest entries that have been decided
and delivered. Ensure that sufficient memory is available on your
system for your chosen cache size limit, considering the size of
MySQL Server's other caches and object pools.
If an unreachable member that is attempting to reconnect requires
a message for recovery, but the message has already been removed
from the message cache, the member cannot reconnect. This
situation is more likely to occur if you have used the
group_replication_member_expel_timeout
system variable (introduced in MySQL 8.0.13) to specify an
additional delay time before suspect members are expelled from a
group. Group Replication's Group Communication System (GCS) alerts
you, by a warning message, when a message that is likely to be
needed for recovery by a member that is currently unreachable is
removed from the message cache. This warning message is logged on
all the active group members (only once for each unreachable
member). Although the group members cannot know for sure what
message was the last message seen by the unreachable member, the
warning message indicates that the cache size might not be
sufficient to support your chosen waiting period before a member
is expelled. In this situation, consider increasing the cache size
limit with reference to the expected volume of messages in the
time period specified by the
group_replication_member_expel_timeout
system variable, so that the cache contains all the missed
messages required for members to return successfully. You can also
consider increasing the cache size limit temporarily if you expect
a member to become unreachable for an unusual period of time.
If you are considering reducing the cache size limit, you can
query the Performance Schema table
memory_summary_global_by_event_name
using the following statement:
SELECT * FROM performance_schema.memory_summary_global_by_event_name
WHERE EVENT_NAME LIKE 'memory/group_rpl/GCS_XCom::xcom_cache';
This returns memory usage statistics for the message cache, including the current number of cached entries and current size of the cache. If you reduce the cache size limit, XCom removes the oldest entries that have been decided and delivered until the current size is below the limit. XCom might temporarily exceed the cache size limit while this removal process is ongoing.