Linux-HA Architecture (Release 2)
Release 2 has a number of components which implement its high-availability cluster management capabilities. This page provides an overview of the architecuture into which these components fit.
The components are as follows:
heartbeat - strongly authenticated communications module
CRM - Cluster Resource Manager
CCM - Strongly Connected Consensus Cluster Membership
LRM - Local Resource Manager
Stonith Daemon - provides node reset services
logd - non-blocking logging daemon
apphbd - application level watchdog timer service
Recovery Manager - application recovery service
- Infrastructure
CTS - Cluster Testing System - cluster stress tests
Special Glib notes
Architecture Diagram
heartbeat - strongly authenticated communication module
Function: startup, shutdown, strongly authenticated communication.
Provides locally-ordered multicast messaging over basically any media - IP-based or not.
Currently talks over the following media plugin types:
unicast UDP ipv4
broadcast UDP ipv4
multicast UDP ipv4
serial links (non-IP) - great for use with firewalls, etc. that play with iptables.
openais - uses OpenAIS' evs communication layer as a communication medium.
- Psuedo-membership communication services:
ping - pings an individual router - allows us to treat it as a pseudo-member.
ping_group - similar to ping except if any in group is up, then the group is up
hbaping - "pings" a fiber channel disk for connectivity
Heartbeat can detect node failure reliably in less than a half-second. With the low-latency patches, (and maybe a bug fix or so), that time could be lowered significantly.
It will register with the system watchdog timer if configured to do so.
The heartbeat layer has an API which provides the following classes of services:
- intra-cluster communication - sending and receiving packets to cluster nodes
- configuration queries
- connectivity information (who can the current node hear packets from) - both for queries and state change notifications
- basic group membership services
Cluster Resource Manager
The Cluster Resource Manager (CRM) is the brains of Linux-HA. It maintains the resource configuration, decides what resources should run where, how to move from the current state into the state where they are running where they ought to be. In addition, it supervises the LRM in accomplishing all these things. The CRM interacts with every component in the system:
It uses heartbeat for communication
It receives membership updates from the CCM.
It directs the work of and receives notifications from the LRM.
It tells the Stonith Daemon when and what to reset
- It logs using the logging daemon
In effect, the PE, TE, and CIB can be viewed as components of the CRM.
CRM Policy Engine (PE)
The primary goal of the policy engine is to compute a transition graph from the current state of the world. This transition graph takes into account the current location and state of resources, availability of nodes and current static configuration (aka the CurrentClusterState).
CRM Transition Engine (TE)
The Transition Engine effectively executes the transition graph produced by the PE - to make it's dreams a reality. That is, it takes a computed next state for the cluster and the list of actions and attempts to reach it by instructing the LRM on remote nodes to start and stop resources.
Cluster Information Base (CIB)
The CIB is an automatically replicated repository for cluster resource and node information as seen by the CRM. This information includes static information such as dependencies, and more dynamic information such as what resources are running where, and what their states are.
All the information in the CIB is represented in XML for maximum usability. We have an annotated DTD which defines all the information we currently manage in the CIB.
Because of this, many of the features of the CRM can be understood by reading the annotated DTD.
Consensus Cluster Membership
Provides strongly connected consensus cluster membership services. Ensures that every node in a computed membership can talk to every other node in this same membership. Implements both an OCF draft membership API, and the SAF AIS membership API. Typically it computes membership in sub-second time.
Local Resource Manager
The Local Resource Manager is basically is a resource agent abstraction. It starts, stops and monitors resources as directed by the CRM.
It has a plugin architecture, and can support many classes of resource agents. The types currently supported include:
- OCF - Open Cluster Framework
- heartbeat (release 1) style resource agents
- LSB - Linux Standards Base (normal init script) resource agents
- stonith - resource agent interface for instantiating STONITH objects to
be used by the StonithDaemon.
Other types are readily added for compatibility with other systems.
Stonith Daemon
The STONITH daemon provides cluster-wide node reset facilities using the improved release 2 stonith API.
The Stonith library includes support for around a dozen types of 'C' STONITH plugins and native support for script-based plugins - allowing scripts in any scripting language to be treated as full-fledged STONITH plugins.
The Stonith Daemon locks itself into memory, and 'C' based plugins are pre-loaded by the stonith daemon so that no disk I/O is required for a STONITH operation using 'C' methods.
The Stonith Daemon provides full support for asymmetric STONITH configurations - allowing for the possibility that certain STONITH devices may be accessible from only a subset of the nodes in the cluster.
Note that it is not currently intended that the Stonith Daemon be used for providing resource-granular fencing. Current thinking regarding resource-granular fencing calls for such fencing to be done by clone resource agents. The resources which need fencing will be dependent on these other resources. Clone resource agents are notified when their various peer resources start and stop.
logd - non-blocking logging daemon
Can log to syslog or files or both. logd never blocks.
Messages are lost if they get too far behind in preference to blocking. Count of messages lost is printed next time we can output messages. Queue sizes are controllable per-application - and overall.
Application Heartbeat Daemon (apphbd)
The application heartbeat daemon is a general service which provides watchdog timer facilities for individual HA-aware applications. When applications fail to check in with it in their prescribed time, interested parties are notified and (presumably) recovery actions taken. This daemon is a simple as we can make it, so it can be the most reliable component in the system. Many Linux-HA system components tie into it, but it is not commonly enabled in the field at this writing. It will register with the system watchdog timer if requested to do so.
Recovery Manager Daemon
The recovery manager daemon is notified by apphbd when a process fails to heartbeat or exits unexpectedly. It then takes actions to (kill and) restart the application.
Infrastructure
A key part of our success in implementation comes from having a consistent, flexible, reliable and general infrastructure underneath all the major components.
This infrastructure has several key elements:
- A flexible general plugin mechanism
- Non-blocking IPC layer
- Complete avoidance of threads (so far)
Use of Glib mainloop as our uniform dispatching (scheduling) and event processing method
- Significant consistency and integration with each other
PILS - Plugin and Interface Loading System
PILS provides a very general plugin loading system which is used extensively in Linux-HA.
This allows for great flexibility and power in the system, while minimizing the size of the running system. This tends to improve the architecture of those subsystems which use plugins. In addition, their power is increased, while minimizing the resource usage of the Linux-HA system on the host servers.
An unexpected benefit of plugins is that almost all the contributions from non-core members come in the area of plugins.
Plugins are currently used in the following areas: communication, authentication, stonith, resource agents, compression, apphbd notification methods.
IPC Library
All interprocess communication is performed using a very general IPC library which provides non-blocking access to IPC using a flexible queueing strategy, and includes integrated flow control. This IPC API does not require sockets, but the currently available implementations use UNIX (Local) Domain sockets.
This API also includes built in authentication and authorization of peer processes, and is portable to most POSIX-like OSes.
Although use of mainloop with these APIs is not required, we provide simple and convenient integration with mainloop.
Cluster Plumbing Library
The Cluster plumbing library is a collection of very useful functions which provide a variety of services used by many of our main components. A few of the major objects provided by this library include:
- compression API (with underlying compression plugins)
- Non-blocking logging API
- memory management oriented to continuously running services
- Hierarchical name-value pair messaging facility - promoting portability and version upgrade compatibility. Also provides optional message compression facilities.
- Signal unification - allowing signals to appear as mainloop events
- Core dump management utilities - promoting capture of core dumps in a uniform way, and under all circumstances
- timers (like glib mainloop timers - but they work even when the time of day clock jumps)
- child process management - death of children causes invocation of process object, with configurable death-of-child messages.
- Triggers - arbitrary events triggered by software
- Realtime management - setting and unsetting high priorities, and locked into memory attributes of processes.
- 64-bit HZ-granularity time manipulation (longclock_t)
- User id management for security purposes - for processes which need some root privileges.
Mainloop integration for IPC, plain file descriptors, signals, etc. This means that all these different event sources are managed and dispatched consistently.
Cluster Testing System
There are two kinds of bugs one finds in reports from users - those that one would not expect to find during testing and those which really should have gotten caught during testing. Linux-HA has a very low bug rate overall, and an extremely low number of bugs of the "should have gotten caught" category.
The Cluster testing system (CTS) is the primary cause for these low bug rates.
CTS is an automated cluster testing system which runs random stress tests on the cluster. Although it is in most ways a modest system with what seem like largely straightforward tests, it has proven extremely effective in practice.
It's basic strategy is: beat the software to death. Such testing has sometimes been called Bamm-Bamm testing.
CTS is an example of a system where the whole is more than simply the sum of its parts.
Glib Usage
The Linux-HA project makes extensive use version 2 of the Gnome Glib library and special use is made of the mainloop event processing structure.
The use of mainloop has made many things much easier and more uniform, and has allowed us (so far) to completely avoid threads and their attendant portability and debugging difficulties.
Data Flow through R2
All communication starts with the heartbeat layer, and every component which communicates with other cluster members does it through the heartbeat layer. In addition, the heartbeat layer provides connectivity information indicating when communication with another node is lost, and when it's restored.
- These connectivity change events are then given to the membership layer (CCM), which then sends packets to its peers in the cluster and figures out exactly which nodes are in the current membership, and which ones aren't.
- Once it has computed a new membership,the CCM notifies its clients of the change in membership. Two of the CCMs most important clients are the CRM and the CIB.
- When it receives a new membership from the CCM, the CIB updates the CIB with the information from the latest membership update, and informs its clients that the CIB has been updated.
- When the CRM notices that the CIB has changed, it then notifies the policy engine (PE)
- The PE then looks at the CIB (including the status section) and sees what things need to be done to bring the state of the cluster (as shown by the status section of the CIB) in line with the defined policies (found in the configuration section of the CIB).
- The PE then creates a directed graph of actions that need to be taken (if any) to bring the cluster into line with the policy, and gives them to the CRM.
- The CRM then gives these actions to the transition engine (TE) which then proceeds to carry them out.
- The TE then (through the CRM) directs the various LRMs across the cluster to take the specified actions.
- Each time an action completes or times out, the TE gets notified of the status.
- The TE then continues to give more actions to the LRMs according to the graph it was given.
- When all actions have been completed, the TE reports success back to the CRM.
