Frequently Asked Questions about DRBD
Contents
- Frequently Asked Questions about DRBD
- General Issues
- Setup and Installation
- Can I encrypt/compress the exchanged data?
- Can I mount the secondary at least readonly?
- Why does DRBD not allow concurrent access from all nodes? I'd like to use it with GFS/OCFS2...
- Can DRBD use two devices of different size?
- Can XFS be used with DRBD?
- When I try to load the drbd module, I am gettin the following error: compiled for kernel version ''some version'' while this kernel is ''some other version''
- Can I use DRBD with LVM?
- I use DRBD and Linux-vServer, and I cannot umount anymore: "file system in use"
- What about Xen, DRBD and iSCSI?
- Can I use DRBD with OpenVZ?
- Operation Issues
- drbdadm create-md fails with "Operation refused." - what can I do?
- I get "access beyond end of device" errors, drbd is broken!?
- Why is Synchronization (SyncingAll) so slow?
- How can I speed up the Synchronization performance?
- How can I speed up write throughput?
- Why is my "load average" that high?
- What is warning: ''Return code 255 from /etc/ha.d/resource.d/datadisk'' telling me when using the datadisk script with heartbeat?
- When the node goes from secondary to primary the drbd device will not be mounted on the primary. Manually mounting works.
- What is warning: ''out of vmalloc space''
- What do the fields like st, ns, nr, dw, dr etc. in /proc/drdb mean?
General Issues
Please have a look at some of the publications and documentation.
What is DRBD, to begin with?
DRBD, developed by PhilippReisner and LarsEllenberg, is a
Distributed
Replicated
Block
Device
for the Linux operating system. It allows to have a realtime mirror of your local block devices on a remote machine. In conjunction with heartbeat it allows to create HA (high availability) Linux clusters.
Which license conditions apply to DRBD?
- DRBD is released under the GNU GENERAL PUBLIC LICENSE Version 2, Juni 1991 (GPL. Thus, within the conditions of this license it can be freely distributed and modified.)
Where do I get Support for DRBD
At LinBit.
Where can I download DRBD, or get more information?
- The latest version of DRBD is available for download
from LinBit resp. from drbd.org. DRBD is also included in many Linux distributions, like Debian, SuSE, RedHat and others.
There is also a git Repository
and a mailing list
Setup and Installation
Can I encrypt/compress the exchanged data?
- Of course. But this is no option within DRBD. You 'just' need to setup some VPN, then the network stack will take care of that. For a lightweight solution, have a look at the
CIPE project. Of course, IPSEC or OpenVPN will do, too.
Can I mount the secondary at least readonly?
Short answer: No!. But see also next question/answer.
- DRBD would not care, but most likely your filesystem will be confused because it will not be aware about changes in the underlying device. This in general means that it cannot work, not with ext2, ext3, reiserFS, JFS or XFS.
Thus, if you want to mount the secondary, set the secondary as the primary first. Both devices mounted at the same time does not work. Actually, DRBD v8 does support two Primaries, see the next answer. If you need access to the data from both nodes, and an arbitrary number of other clients, consider using HaNFS.
Why does DRBD not allow concurrent access from all nodes? I'd like to use it with GFS/OCFS2...
- Actually, DRBD version 8.0.x and later support this.
If you need not just a mirrored, but a shared filesystem, use OCFS2 or GFS2 for example. But these are much slower, and typically expect write access on all nodes in question. If we have more than one node concurrently modifying distributed devices, we have some "interessting" problems to decide which part of the device is up-to-date on which node, and what blocks need to be resynchronized in which direction. These problems have been solved. You need to net { allow-two-primaries; } to activate this mode. But the handling of DRBD in "cluster fs mode" is still more complex and cumbersome than "classical" one-node-at-a-time access.
- An other option would be to have only one node active, export that device via iSCSI, then run OCFS2 on iSCSI.
Also have a look at the DRBD Changelog.
Can DRBD use two devices of different size?
- Generally yes, but there are some issues to consider:
Locally DRBD uses the configured disk-size, which has to be <= physical, and if not given its is set to the physical size. On connect the device size will be set to the minimum of both nodes. And here you could run into problems, if you do things without common sense: if you first use drbd on one node only, without disk-size configured properly, and later connect a node with smaller device size, then the drbd device size shrinks at runtime. you should find a message about Your size hint is bogus, please change to <some value> in the syslog in that case. This will confuse the file system on top of your device. Thus, if your device sizes differ, set the size to be used for DRBD explicitely.
DRBD-0.7 stores information about the peers device size in its local meta data, therefore usage of disk-size is deprecated (and is disallowed in the configuration file).
Can XFS be used with DRBD?
- XFS uses dynamic block size, thus DRBD 0.7 or later is needed.
When I try to load the drbd module, I am gettin the following error: compiled for kernel version ''some version'' while this kernel is ''some other version''
The settings for your actual kernel and the .config for the kernel source against which drbd was build do not match. On SuSE Linux you can get the right config with the following commands: cd /usr/src/linux/ && make cloneconfig && make dep Ususally, you do not have to recompile your kernel, just drbd. But read INSTALL in the drbd tgz, to learn how to do it the proper way.
Can I use DRBD with LVM?
- Yes. With LVM2, snapshots are writeable. So you can replay the journal on the snapshot. But see also
A Summary of LVM snapshots with DRBD posted on 2004-04-08 in drbd-user.
I use DRBD and Linux-vServer, and I cannot umount anymore: "file system in use"
Maybe http://linux-vserver.org/advanced+DRBD+mount+issues helps.
What about Xen, DRBD and iSCSI?
Always interesting discussions on http://lists.xensource.com/archives/html/xen-users/
Can I use DRBD with OpenVZ?
Operation Issues
drbdadm create-md fails with "Operation refused." - what can I do?
- the actual error message looks like
Found $some filesystem which uses $somuch kB current configuration leaves usable $less kB Device size would be truncated, which would corrupt data and result in 'access beyond end of device' errors. You need to either * use external meta data (recommended) * shrink that filesystem first * zero out the device (destroy the filesystem) Operation refused.
- which means
- you created your filesystem before you created your DRBD resource, or
- you created your filesystem on your backing device, rather than your DRBD,
- neither of which is a problem by itself, except - as the error message tries to hint - you need to enlarge the device (e.g. lvextend), shrink the filesystem (e.g. resize2fs), or place the DRBD metadata somewhere else (external meta data).
- DRBD tries to detect an existing use of the block device in question. E.g. if it detects an existing file system that uses all the available space (as is default for most filesystems), and you try to use DRBD with internal meta data, there is no room for the internal meata data - creating that would corrupt the last few MiB of the existing file system.
- If re-creating the filesystem on the DRBD is an option, one way to "zero out the device (destroy the filesystem)", and then recreate it on the DRBD is
dd if=/dev/zero bs=1M count=1 of=/dev/sdXYZ; sync drbdadm create-md $r drbdadm -- -o primary $r mkfs /dev/drbdY
- If drbdadm would not refuse, you would soon be back reading the next answer.
I get "access beyond end of device" errors, drbd is broken!?
- Your kernel log fills with
attempt to access beyond end of device drbd0: rw=1, want=211992584, limit=211986944 Buffer I/O error on device drbd0, logical block 26499072
Your file system then remounts read-only, panics or similar. When you try to fsck, you get something likeThe filesystem size (according to the superblock) is ... blocks. The physical size of the device is ...+x blocks.
Envision this:|-- usable area with drbd and internal meta data --|-IMD-| |-- real device -----------------------------------------|
IMD is "internal meta data". Once created, it is fixed size. With drbd 0.7 it was fixed 128MB. With drbd 8.0 it is approximately [total storage of real device]/4/8/512/2 rounded up, +36k, rounded up to the next 4k.exaple: grep -e hda4 -e drbd0 /proc/partitions 3 4 105996744 hda4 147 0 105993472 drbd0 ceil(105996744 kB / 32768) == 3235 kB + 36 kB == 3271 kB 4k aligned == 3272 kB 105996744 kB - 3272 kB == 105993472 kB - If you did mkfs /real/device, then later mount through DRBD, the file system either recognized size mismatch in superblock vs. actual block device size on the spot and refuse to mount (xfs does this, iirc).
- Or the file system mounts alright, because it skips the check for block device size (ext3, at least certain version of it, aparently do this; it is ok for a file system to assume that its superblock contains valid data) and then thinks it could use the now not available space which is occupied by IMD.
- There are various ways to find out what your file system thinks about the usable space it occupies. For ext3, you can find out with e.g.
tune2fs -l /dev/whatever | awk '/^Block.size:/ { bs=$NF } /^Block.count:/ { bc=$NF } END { print bc * bs / 1024, "kB" }' - As long as the file system does not want to use that area, it won't notice. If the file system eventually decides to use that area, whops, surprise, it gets an access beyond end of device error. When the file system will start using that area is nearly impossible to pretict. So it may appear to work fine for month, and then suddenly break again and again.
This is not a problem with drbd. It is a problem with using drbd incorrectly. also see http://thread.gmane.org/gmane.linux.network.drbd/12690/focus=12692 or serach the list archives for more ascii art and explanations.
Why is Synchronization (SyncingAll) so slow?
Outdated, applies to drbd versions prior drbd-0.6.4 only For historical reasons replicate used to work backwards. Most physical devices do have a pretty slow thoughput when writing data backwards.
How can I speed up the Synchronization performance?
double check the value of sync-max in the net {} section (drbd-0.6) resp. rate in the syncer {} section (drbd-0.7). Keep in mind that the default value is very low, and the default unit is kByte/sec!
- if you run on top of some local RAID, make sure it is not reconstructing at the same time
check whether DMA is enabled
How can I speed up write throughput?
- First you need to find the bottleneck. This can be your local disk, the network, the remote disk, latency caused by excessive seeks, or the summed up latency of those components.
You may want to play with the values of protocol and sndbuf-size. If your NIC supports it, you may want to enable "jumbo frames" (up the value of the MTU). If nothing helps, ask the list for known good and performant setups...
Why is my "load average" that high?
- Load average is defined as average number of processes in the runqueue during a given interval. A process is in the run queue, if it is
- not waiting for external events (e.g. select on some fd)
- not waiting on its own (not called "wait" explicitly)
not stopped
Note that all processes waiting for disk io are counted as runable! Therefore, if a lot of processes wait for disk io, the "load average" goes straight up, though the system actually may be almost idle cpu-wise ... E.g. crash your nfs server, and start 100 ls /path/to/non-cached/dir/on/nfs/mount-point on a client... you get a "load average" of 100+ for as long as the nfs timeout, which might be weeks ... though the cpu does nothing. Verify your system load by other means, e.g. vmstat, sysstat/sar. This will give you an idea of the bottleneck of your system. Some ideas are using multiple disks (not just partitions!) or even a RAID with 10.000rpm SCSI disks and probably even a Gigabit Ethernet. Even on a Fast Ethernet device you will rarely see more then 6 MByte per second. (100 MBit/s is at most 12.5 MByte/s minus protocol overhead and latency etc.).
What is warning: ''Return code 255 from /etc/ha.d/resource.d/datadisk'' telling me when using the datadisk script with heartbeat?
DRBD-0.6 only
Exit code 255 is most likely from a script generated die, which include a verbose error message. Capture the output of that script. this is the debugfile directive in your ha.cf, iirc. If that does not help, do it by hand, and see what error message it gives. datadisk says something like cannot promote to primary, sychronization running or fsck failed or ...
When the node goes from secondary to primary the drbd device will not be mounted on the primary. Manually mounting works.
Feature ...
DRBD does not automaticaly mount the partition. The script datadisk (or drbddisk since 0.7) is made for that purpose. It is intended to be called by hartbeat.
What is warning: ''out of vmalloc space''
For each device, drbd will (try to) allocate X MB of bitmap, plus some constant amount (<1MB). X = storage_size_in_GB/32, so 1 TB storage -> 32 MB bitmap.
By default Linux allocates 128MB to Vmalloc. For systems using more than 4TB, this may cause an issue.
If you get the following error message in /var/log/messages, Try a Linux 2.6 hugemem kernel.
kernel: allocation failed: out of vmalloc space - use vmalloc=<size> to increase size.
What do the fields like st, ns, nr, dw, dr etc. in /proc/drdb mean?
0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
^ ^ ^ ^ ^ ^-[*]
| | | | |
| | | | wire protocol -ยด
| | | `- disk state
| | `- state (should be named role, but historically)
| `- connection state
`- minor number
ns:67582830 nr:1293290 dw:68880243 dr:124296536
net sent net read disk write disk read
al:13693 bm:101
meta data updates for activity log bitmap
,--------- lo:0 pe:0 ua:0 ap:0
gauges of currently pending requests, see below.
,- resync: used:0/31 hits:335 misses:109 starving:0 dirty:0 changed:109
|- act_log: used:0/1801 hits:6527480 misses:13790 starving:0 dirty:97 changed:13693
cache statistics for the resync and activity log in memory cache.
you can safely ignore these.
[*]: four characters showing certain flag bits
- The first character is
- [rs]: io running(resumed)/suspended.
- See drbdadm suspend-io/resume-io. Also temporarily set implicitly by fencing resource-and-stonith.
- a: implicitly paused because of sync-after dependency on this node
- p: implicitly paused because of sync-after dependency on the peer node
- u: explicitly suspended by the user, see drbdadm pause-sync/resume-sync
- cs
- connection state
Unconfigured
Device waits for configuration.
StandAlone
Not trying to connect to peer, IO requests are only passed on locally.
Unconnected
Transitory state, while bind() blocks.
WFConnection
Device waits for configuration of other side.
WFReportParams
Transitory state, while waiting for first packet on a new TCP connection.
Connected
Everything is fine.
Timeout, BrokenPipe, NetworkFailure
Transitory states when connection was lost.
DRBD-0.6 specific SyncingAll
All blocks of the primary node are being copied to the secondary node.
SyncingQuick
The secondary is updated, by copying the blocks which were updated since the now secondary node has left the cluster.
SyncPaused
Sync of this device has paused while higher priority (lower sync-group value) device is resyncing.
DRBD-0.7 / DRBD-8; trailing S or T indicates this node is SyncSource or SyncTarget, respectively. WFBitMap{S,T}
Transitory state when synchronization starts; "dirty"-bits are exchanged.
SyncSource
Synchronization in progress, this node has the good data.
SyncTarget
Synchronization in progress, this node has inconsistent data.
PausedSync{S,T}
see SyncPaused.
SkippedSync{S,T}
you should never see this. "Developers only"
- st:Local/Remote
- state, the respective node's role for this device.
Primary
the active node; may access the device.
Secondary
the passive node; must not access the device; expects mirrored writes from the other node.
Unconfigured
this is not a role, obviously.
- ld
- local data consistentency (DRBD-0.7)
- ds
- disk state (DRBD 8)
Diskless
No storage attached, or storage had IO errors previously and got detached.
Attaching
in the process of attaching the local storage
Failed
storage had io errors
Negotiating
storage attached, but is not yet decided whether it is UpToDate
Inconsistent
storage is Inconsistent (e.g. half way during bitmap based resync)
Outdated
storage is consistent, but not UpToDate
DUnknown
(peer's) storage state is not known
Consistent
storage is consistent, not yet decided whether it is UpToDate or Outdated
storage is good
- ns,nr,dw,dr,...
- statistic counters in number of blocks (1KB) respectively number of requests
ns
network send
nr
network receive
dw
disk write
dr
disk read
al
activity log updates (0.7 and later)
bm
bitmap updates (0.7 and later)
lo
reference count on local device
pe
pending (waiting for ack)
ua
unack'd (still need to send ack)
ap
application requests expecting io-completion
