SFEX (Shared Disk File EXclusiveness Control Program)
Download
sfex-1.3.tar.gz (79.1 KB) [md5sum: 6bc563ad4a22d39b5f0fcf8726138ae8]
Basic Consept
- SFEX is resource which control ownership of shared disk.
- SFEX uses an special partition on the shared disk, and maintains the following data.
- "status" shows whether a disk is owned by somebody.
- "node" shows the node name to own the disk.
- "count" is used for judgment that owner node is up or down.
- Typically, resources which use data partition on the shared disk(like PostgreSQL) make resource group with SFEX.
- Resource of node which is holding ownership can access data partition.
- When node can get ownership?
- case1: Nobody has ownership.
- case2: Node can judge another node is down.
sequence diagram
start up process
SFEX can start on the node which has the highest score in cib.xml because more than one nodes do not access shared disk at the same time.
Node A
- SFEX reads data from shared disk, and get "status". Usually "status" is "NO_OWNED" because nobody has owned shared disk.
- Writes data that include node=Node A and status=OWNED.
- Reads data again, and get "node=Node A".
- Compareses it with my node name. If node name has not been changed, Node A get ownership!!
- SFEX increments "count" on the shared disk by monitor processing of heartbeat. This processing means the update of ownership.
Heartbeat communication failuer
Node A
- SFEX updates ownership by HB monitor processing.
Node B
When heartbeat communication fail, standby node(Node
starts resources. - SFEX reads data on the sheard disk.
- Waits a while. Wait time should be longer than sfex monitor interval. By this wait time, it waits for periodical update from Node A and confirms that Node A maintains ownership.
- Reads data again.
- Checks value of new "count". When the values of two "count" are different, it is able to think that Node A is up.
- SFEX starts up process is stopped.
Active Node failure
Node A
- Node A is downed by failure.
Node B
This Node B start up in the same way as HB communication failure.
- Waits for a while. It waits for periodical update from Node A but confirms that Node A does not it.
- SFEX reads data again.
- Checks value of new "count". The values of two "count" are SAME, it is able to think that Node A is DOWN.
- Writes data that include node=Node B and status=OWNED.
- Reads data again.
- Compareses it with my node name. If node name has not been changed, Node B get ownership!!
- Afterwards, other resources start.
Disk access on the same time
This is hardly generated. However for example, this case occurs when multiple nodes start up at the same time without heartbeat communication.
Node A / Node B
Writing to shared disk is serialized finally because writable area is "one". As a result, the node name written at the last time remains. In this example, Node B remains.
- Read data again
- Node A: value of "owner" is changed. this node does not get ownership. Node B: value of "owner" is name of Node B. Node B get ownnership!!
sample cib.xml
<cib admin_epoch="0" epoch="1" have_quorum="false" cib_feature_revision="1.3">
<configuration>
<crm_config>
<cluster_property_set id="set01">
<attributes>
<nvpair id="symmetric-cluster"
name="symmetric-cluster" value="true"/>
<nvpair id="no-quorum-policy"
name="no-quorum-policy" value="ignore"/>
<nvpair id="stonith-enabled"
name="stonith-enabled" value="false"/>
<nvpair id="short-resource-names"
name="short-resource-names" value="true"/>
<nvpair id="is-managed-default"
name="is-managed-default" value="true"/>
<nvpair id="default-resource-stickiness"
name="default-resource-stickiness" value="INFINITY"/>
<nvpair id="stop-orphan-resources"
name="stop-orphan-resources" value="true"/>
<nvpair id="stop-orphan-actions"
name="stop-orphan-actions" value="true"/>
<nvpair id="remove-after-stop"
name="remove-after-stop" value="false"/>
<nvpair id="default-resource-failure-stickiness"
name="default-resource-failure-stickiness" value="-INFINITY"/>
<nvpair id="stonith-action"
name="stonith-action" value="reboot"/>
<nvpair id="default-action-timeout"
name="default-action-timeout" value="120s"/>
<nvpair id="dc-deadtime"
name="dc-deadtime" value="10s"/>
<nvpair id="cluster-recheck-interval"
name="cluster-recheck-interval" value="0"/>
<nvpair id="election-timeout"
name="election-timeout" value="2min"/>
<nvpair id="shutdown-escalation"
name="shutdown-escalation" value="20min"/>
<nvpair id="crmd-integration-timeout"
name="crmd-integration-timeout" value="3min"/>
<nvpair id="crmd-finalization-timeout"
name="crmd-finalization-timeout" value="10min"/>
<nvpair id="cluster-delay"
name="cluster-delay" value="180s"/>
<nvpair id="pe-error-series-max"
name="pe-error-series-max" value="-1"/>
<nvpair id="pe-warn-series-max"
name="pe-warn-series-max" value="-1"/>
<nvpair id="pe-input-series-max"
name="pe-input-series-max" value="-1"/>
<nvpair id="startup-fencing"
name="startup-fencing" value="true"/>
</attributes>
</cluster_property_set>
</crm_config>
<nodes/>
<resources>
<group id="grpPostgreSQLDB">
<primitive id="prmExPostgreSQLDB" class="ocf" type="sfex" provider="heartbeat">
<operations>
<op id="exPostgreSQLDB_start"
name="start" timeout="180s" on_fail="fence"/>
<op id="exPostgreSQLDB_monitor"
name="monitor" interval="10s" timeout="60s" on_fail="fence"/>
<op id="exPostgreSQLDB_stop"
name="stop" timeout="60s" on_fail="fence"/>
</operations>
<instance_attributes id="atrExPostgreSQLDB">
<attributes>
<nvpair id="dskPostgreSQLDB"
name="device" value="/dev/cciss/c1d0p1"/>
<nvpair id="idxPostgreSQLDB"
name="index" value="1"/>
<nvpair id="cltPostgreSQLDB"
name="collision_timeout" value="1"/>
<nvpair id="lctPostgreSQLDB"
name="lock_timeout" value="70"/>
<nvpair id="mntPostgreSQLDB"
name="monitor_interval" value="10"/>
<nvpair id="fckPostgreSQLDB"
name="fsck" value="/sbin/fsck -p /dev/cciss/c1d0p2"/>
<nvpair id="fcmPostgreSQLDB"
name="fsck_mode" value="check"/>
<nvpair id="hltPostgreSQLDB"
name="halt" value="/sbin/halt -f -n -p"/>
</attributes>
</instance_attributes>
</primitive>
<primitive id="prmFsPostgreSQLDB" class="ocf" type="Filesystem" provider="heartbeat">
<operations>
<op id="fsPostgreSQLDB_start"
name="start" timeout="60s" on_fail="fence"/>
<op id="fsPostgreSQLDB_monitor"
name="monitor" interval="10s" timeout="60s" on_fail="fence"/>
<op id="fsPostgreSQLDB_stop"
name="stop" timeout="60s" on_fail="fence"/>
</operations>
<instance_attributes id="atrFsPostgreSQLDB">
<attributes>
<nvpair id="devPostgreSQLDB"
name="device" value="/dev/cciss/c1d0p2"/>
<nvpair id="dirPostgreSQLDB"
name="directory" value="/mnt/shared-disk"/>
<nvpair id="fstPostgreSQLDB"
name="fstype" value="ext3"/>
</attributes>
</instance_attributes>
</primitive>
<primitive id="prmIpPostgreSQLDB" class="ocf" type="IPaddr" provider="heartbeat">
<operations>
<op id="ipPostgreSQLDB_start"
name="start" timeout="60s" on_fail="fence"/>
<op id="ipPostgreSQLDB_monitor"
name="monitor" interval="10s" timeout="60s" on_fail="fence"/>
<op id="ipPostgreSQLDB_stop"
name="stop" timeout="60s" on_fail="fence"/>
</operations>
<instance_attributes id="atrIpPostgreSQLDB">
<attributes>
<!-- chenge ip address attribute -->
<nvpair id="ipPostgreSQLDB" name="ip" value="aaa.bbb.ccc.ddd"/>
<nvpair id="maskPostgreSQLDB" name="netmask" value="nn"/>
<nvpair id="nicPostgreSQLDB" name="nic" value="bond0"/>
</attributes>
</instance_attributes>
</primitive>
<primitive id="prmApPostgreSQLDB" class="ocf" type="pgsql" provider="heartbeat">
<operations>
<op id="apPostgreSQLDB_start"
name="start" timeout="60s" on_fail="fence"/>
<op id="apPostgreSQLDB_monitor"
name="monitor" interval="30s" timeout="60s" on_fail="fence"/>
<op id="apPostgreSQLDB_stop"
name="stop" timeout="60s" on_fail="fence"/>
</operations>
<instance_attributes id="atrApPostgreSQLDB">
<attributes>
<nvpair id="pgctl01"
name="pgctl" value="/usr/local/pgsql/bin/pg_ctl"/>
<nvpair id="psql01"
name="psql" value="/usr/local/pgsql/bin/psql"/>
<nvpair id="pgdata01"
name="pgdata" value="/mnt/shared-disk/pgsql/data"/>
<nvpair id="pgdba01"
name="pgdba" value="postgres"/>
<nvpair id="pgdb01"
name="pgdb" value="template1"/>
<nvpair id="logfile01"
name="logfile" value="/var/log/pgsql.log"/>
</attributes>
</instance_attributes>
</primitive>
</group>
</resources>
<constraints>
<rsc_location id="rlcPostgreSQLDB" rsc="grpPostgreSQLDB">
<rule id="rulPostgreSQLDB_node01" score="200">
<expression id="expPostgreSQLDB_node01"
attribute="#uname" operation="eq" value="sfex01" />
</rule>
<rule id="rulPostgreSQLDB_node02" score="100">
<expression id="expPostgreSQLDB_node02"
attribute="#uname" operation="eq" value="sfex02"/>
</rule>
</rsc_location>
<rsc_location id="ping1:disconn" rsc="grpPostgreSQLDB">
<rule id="ping1:disconn:rule" score="-INFINITY" boolean_op="and">
<expression id="ping1:disconn:expr:defined"
attribute="default_ping_set" operation="defined"/>
<expression id="ping1:disconn:expr:positive"
attribute="default_ping_set" operation="lt" value="100"/>
</rule>
</rsc_location>
</constraints>
</configuration>
<status/>
</cib>
Release Notes
sfex -- Ver1.3
- 2008/03/07
