DRBD: ensure peers are UpToDate for dual-primary
authorApollon Oikonomopoulos <apoikos@gmail.com>
Tue, 5 Nov 2013 14:30:45 +0000 (16:30 +0200)
committerMichele Tartara <mtartara@google.com>
Wed, 6 Nov 2013 10:25:21 +0000 (10:25 +0000)
DrbdAttachNet supports both, normal primary/secondary node operation, and
(during live migration) dual-primary operation. When resources are newly
attached, we poll until we find all of them in connected or syncing operation.

Although aggressive, this is enough for primary/secondary operation, because
the primary/secondary role is not changed from within DrbdAttachNet. However,
in the dual-primary ("multimaster") case, both peers are subsequently upgraded
to the primary role.  If - for unspecified reasons - both disks are not
UpToDate, then a resync may be triggered after both peers have switched to
primary, causing the resource to disconnect:

  kernel: [1465514.164009] block drbd2: I shall become SyncTarget, but I am
  kernel: [1465514.171562] block drbd2: ASSERT( os.conn == C_WF_REPORT_PARAMS )
    in /build/linux-rrsxby/linux-3.2.51/drivers/block/drbd/drbd_receiver.c:3245

This seems to be extremely racey and is possibly triggered by some underlying
network issues (e.g. high latency), but it has been observed in the wild. By
logging the DRBD resource state in the old secondary, we managed to see a
resource getting promoted to primary while it was:

  WFSyncUUID Secondary/Primary Outdated/UpToDate

We fix this by explicitly waiting for "Connected" cstate and
"UpToDate/UpToDate" disks, as advised in [1]:

  "For this purpose and scenario,
   you only want to promote once you are Connected UpToDate/UpToDate."

[1] http://lists.linbit.com/pipermail/drbd-user/2013-July/020173.html

Signed-off-by: Apollon Oikonomopoulos <apoikos@gmail.com>
Signed-off-by: Michele Tartara <mtartara@google.com>
Reviewed-by: Michele Tartara <mtartara@google.com>
Reviewed-by: Klaus Aehlig <aehlig@google.com>


index a75432b..9e12639 100644 (file)
@@ -3622,8 +3622,20 @@ def DrbdAttachNet(nodes_ip, disks, instance_name, multimaster):
     for rd in bdevs:
       stats = rd.GetProcStatus()
-      all_connected = (all_connected and
-                       (stats.is_connected or stats.is_in_resync))
+      if multimaster:
+        # In the multimaster case we have to wait explicitly until
+        # the resource is Connected and UpToDate/UpToDate, because
+        # we promote *both nodes* to primary directly afterwards.
+        # Being in resync is not enough, since there is a race during which we
+        # may promote a node with an Outdated disk to primary, effectively
+        # tearing down the connection.
+        all_connected = (all_connected and
+                         stats.is_connected and
+                         stats.is_disk_uptodate and
+                         stats.peer_disk_uptodate)
+      else:
+        all_connected = (all_connected and
+                         (stats.is_connected or stats.is_in_resync))
       if stats.is_standalone:
         # peer had different config info and this node became
index 7623869..7226f1f 100644 (file)
@@ -1050,7 +1050,7 @@ class LogicalVolume(BlockDev):
     _ThrowError("Can't grow LV %s: %s", self.dev_path, result.output)
-class DRBD8Status(object):
+class DRBD8Status(object): # pylint: disable=R0902
   """A DRBD status representation class.
   Note that this doesn't support unconfigured devices (cs:Unconfigured).
@@ -1135,6 +1135,7 @@ class DRBD8Status(object):
     self.is_diskless = self.ldisk == self.DS_DISKLESS
     self.is_disk_uptodate = self.ldisk == self.DS_UPTODATE
+    self.peer_disk_uptodate = self.rdisk == self.DS_UPTODATE
     self.is_in_resync = self.cstatus in self.CSET_SYNC
     self.is_in_use = self.cstatus != self.CS_UNCONFIGURED