SFEX (Shared Disk File EXclusiveness Control Program)

Download

Basic Consept

sequence diagram

start up process

SFEX can start on the node which has the highest score in cib.xml because more than one nodes do not access shared disk at the same time.

Node A

  1. SFEX reads data from shared disk, and get "status". Usually "status" is "NO_OWNED" because nobody has owned shared disk.
  2. Writes data that include node=Node A and status=OWNED.
  3. Reads data again, and get "node=Node A".
  4. Compareses it with my node name. If node name has not been changed, Node A get ownership!!
  5. SFEX increments "count" on the shared disk by monitor processing of heartbeat. This processing means the update of ownership.

Heartbeat communication failuer

Node A

  1. SFEX updates ownership by HB monitor processing.

Node B

  1. When heartbeat communication fail, standby node(Node B) starts resources.

  2. SFEX reads data on the sheard disk.
  3. Waits a while. Wait time should be longer than sfex monitor interval. By this wait time, it waits for periodical update from Node A and confirms that Node A maintains ownership.
  4. Reads data again.
  5. Checks value of new "count". When the values of two "count" are different, it is able to think that Node A is up.
  6. SFEX starts up process is stopped.

Active Node failure

Node A

  1. Node A is downed by failure.

Node B

This Node B start up in the same way as HB communication failure.

  1. Waits for a while. It waits for periodical update from Node A but confirms that Node A does not it.
  2. SFEX reads data again.
  3. Checks value of new "count". The values of two "count" are SAME, it is able to think that Node A is DOWN.
  4. Writes data that include node=Node B and status=OWNED.
  5. Reads data again.
  6. Compareses it with my node name. If node name has not been changed, Node B get ownership!!
  7. Afterwards, other resources start.

Disk access on the same time

This is hardly generated. However for example, this case occurs when multiple nodes start up at the same time without heartbeat communication.

Node A / Node B

Writing to shared disk is serialized finally because writable area is "one". As a result, the node name written at the last time remains. In this example, Node B remains.

  1. Read data again
  2. Node A: value of "owner" is changed. this node does not get ownership. Node B: value of "owner" is name of Node B. Node B get ownnership!!

sample cib.xml

 <cib admin_epoch="0" epoch="1" have_quorum="false" cib_feature_revision="1.3">
  <configuration>
    <crm_config>
      <cluster_property_set id="set01">
        <attributes>
          <nvpair id="symmetric-cluster"
            name="symmetric-cluster" value="true"/>
          <nvpair id="no-quorum-policy"
            name="no-quorum-policy" value="ignore"/>
          <nvpair id="stonith-enabled"
            name="stonith-enabled" value="false"/>
          <nvpair id="short-resource-names"
            name="short-resource-names" value="true"/>
          <nvpair id="is-managed-default"
            name="is-managed-default" value="true"/>
          <nvpair id="default-resource-stickiness"
            name="default-resource-stickiness" value="INFINITY"/>
          <nvpair id="stop-orphan-resources"
            name="stop-orphan-resources" value="true"/>
          <nvpair id="stop-orphan-actions"
            name="stop-orphan-actions" value="true"/>
          <nvpair id="remove-after-stop"
            name="remove-after-stop" value="false"/>
          <nvpair id="default-resource-failure-stickiness"
            name="default-resource-failure-stickiness" value="-INFINITY"/>
          <nvpair id="stonith-action"
            name="stonith-action" value="reboot"/>
          <nvpair id="default-action-timeout"
            name="default-action-timeout" value="120s"/>
          <nvpair id="dc-deadtime"
            name="dc-deadtime" value="10s"/>
          <nvpair id="cluster-recheck-interval"
            name="cluster-recheck-interval" value="0"/>
          <nvpair id="election-timeout"
            name="election-timeout" value="2min"/>
          <nvpair id="shutdown-escalation"
            name="shutdown-escalation" value="20min"/>
          <nvpair id="crmd-integration-timeout"
            name="crmd-integration-timeout" value="3min"/>
          <nvpair id="crmd-finalization-timeout"
            name="crmd-finalization-timeout" value="10min"/>
          <nvpair id="cluster-delay"
            name="cluster-delay" value="180s"/>
          <nvpair id="pe-error-series-max"
            name="pe-error-series-max" value="-1"/>
          <nvpair id="pe-warn-series-max"
            name="pe-warn-series-max" value="-1"/>
          <nvpair id="pe-input-series-max"
            name="pe-input-series-max" value="-1"/>
          <nvpair id="startup-fencing"
            name="startup-fencing" value="true"/>
        </attributes>
      </cluster_property_set>
    </crm_config>
    <nodes/>
    <resources>
      <group id="grpPostgreSQLDB">
        <primitive id="prmExPostgreSQLDB" class="ocf" type="sfex" provider="heartbeat">
          <operations>
            <op id="exPostgreSQLDB_start"
              name="start" timeout="180s" on_fail="fence"/>
            <op id="exPostgreSQLDB_monitor"
              name="monitor" interval="10s" timeout="60s" on_fail="fence"/>
            <op id="exPostgreSQLDB_stop"
              name="stop" timeout="60s" on_fail="fence"/>
          </operations>
          <instance_attributes id="atrExPostgreSQLDB">
            <attributes>
              <nvpair id="dskPostgreSQLDB"
                name="device" value="/dev/cciss/c1d0p1"/>
              <nvpair id="idxPostgreSQLDB"
                name="index" value="1"/>
              <nvpair id="cltPostgreSQLDB"
                name="collision_timeout" value="1"/>
              <nvpair id="lctPostgreSQLDB"
                name="lock_timeout" value="70"/>
              <nvpair id="mntPostgreSQLDB"
                name="monitor_interval" value="10"/>
              <nvpair id="fckPostgreSQLDB"
                name="fsck" value="/sbin/fsck -p /dev/cciss/c1d0p2"/>
              <nvpair id="fcmPostgreSQLDB"
                name="fsck_mode" value="check"/>
              <nvpair id="hltPostgreSQLDB"
                name="halt" value="/sbin/halt -f -n -p"/>
            </attributes>
          </instance_attributes>
        </primitive>
        <primitive id="prmFsPostgreSQLDB" class="ocf" type="Filesystem" provider="heartbeat">
          <operations>
            <op id="fsPostgreSQLDB_start"
              name="start" timeout="60s" on_fail="fence"/>
            <op id="fsPostgreSQLDB_monitor"
              name="monitor" interval="10s" timeout="60s" on_fail="fence"/>
            <op id="fsPostgreSQLDB_stop"
              name="stop" timeout="60s" on_fail="fence"/>
          </operations>
          <instance_attributes id="atrFsPostgreSQLDB">
            <attributes>
              <nvpair id="devPostgreSQLDB"
                name="device" value="/dev/cciss/c1d0p2"/>
              <nvpair id="dirPostgreSQLDB"
                name="directory" value="/mnt/shared-disk"/>
              <nvpair id="fstPostgreSQLDB"
                name="fstype" value="ext3"/>
            </attributes>
          </instance_attributes>
        </primitive>
        <primitive id="prmIpPostgreSQLDB" class="ocf" type="IPaddr" provider="heartbeat">
          <operations>
            <op id="ipPostgreSQLDB_start"
              name="start" timeout="60s" on_fail="fence"/>
            <op id="ipPostgreSQLDB_monitor"
              name="monitor" interval="10s" timeout="60s" on_fail="fence"/>
            <op id="ipPostgreSQLDB_stop"
              name="stop" timeout="60s" on_fail="fence"/>
            </operations>
          <instance_attributes id="atrIpPostgreSQLDB">
            <attributes>
              <!-- chenge ip address attribute -->
              <nvpair id="ipPostgreSQLDB" name="ip" value="aaa.bbb.ccc.ddd"/>
              <nvpair id="maskPostgreSQLDB" name="netmask" value="nn"/>
              <nvpair id="nicPostgreSQLDB" name="nic" value="bond0"/>
            </attributes>
          </instance_attributes>
        </primitive>
        <primitive id="prmApPostgreSQLDB" class="ocf" type="pgsql" provider="heartbeat">
          <operations>
            <op id="apPostgreSQLDB_start"
              name="start" timeout="60s" on_fail="fence"/>
            <op id="apPostgreSQLDB_monitor"
              name="monitor" interval="30s" timeout="60s" on_fail="fence"/>
            <op id="apPostgreSQLDB_stop"
              name="stop" timeout="60s" on_fail="fence"/>
          </operations>
          <instance_attributes id="atrApPostgreSQLDB">
            <attributes>
              <nvpair id="pgctl01"
                name="pgctl" value="/usr/local/pgsql/bin/pg_ctl"/>
              <nvpair id="psql01"
                name="psql" value="/usr/local/pgsql/bin/psql"/>
              <nvpair id="pgdata01"
                name="pgdata" value="/mnt/shared-disk/pgsql/data"/>
              <nvpair id="pgdba01"
                name="pgdba" value="postgres"/>
              <nvpair id="pgdb01"
                name="pgdb" value="template1"/>
              <nvpair id="logfile01"
                name="logfile" value="/var/log/pgsql.log"/>
            </attributes>
          </instance_attributes>
        </primitive>
      </group>
    </resources>
    <constraints>
      <rsc_location id="rlcPostgreSQLDB" rsc="grpPostgreSQLDB">
        <rule id="rulPostgreSQLDB_node01" score="200">
          <expression id="expPostgreSQLDB_node01"
            attribute="#uname" operation="eq" value="sfex01" />
        </rule>
        <rule id="rulPostgreSQLDB_node02" score="100">
          <expression id="expPostgreSQLDB_node02"
            attribute="#uname" operation="eq" value="sfex02"/>
        </rule>
      </rsc_location>
      <rsc_location id="ping1:disconn" rsc="grpPostgreSQLDB">
        <rule id="ping1:disconn:rule" score="-INFINITY" boolean_op="and">
          <expression id="ping1:disconn:expr:defined"
            attribute="default_ping_set" operation="defined"/>
          <expression id="ping1:disconn:expr:positive"
            attribute="default_ping_set" operation="lt" value="100"/>
        </rule>
      </rsc_location>
    </constraints>
  </configuration>
  <status/>
 </cib>

Release Notes

sfex (last edited 2008-06-30 10:01:51 by TakayukiTanaka)