On 12/10/14 02:30 AM, John R Pierce wrote: > so I've had a drbd replica running for a while of a 16TB raid thats used > as a backuppc repository. > > when I have rebooted the backuppc server, the replica doesn't seem to > auto-restart til I do it manually, and the backupc /data file system on > this 16TB LUN doesn't seem to automount, either. > > I've rebooted this thing a few times in the 18 months or so its been > running... not always cleanly... > > anyways, I'm started a drbd verify (from the slave) about 10 hours ago, > it has 15 hours more to run, and so far it's logged... > > Oct 11 13:58:26 sg2 kernel: block drbd0: Starting Online Verify from > sector 3534084704 > Oct 11 14:00:23 sg2 kernel: block drbd0: [drbd0_worker/2197] > sock_sendmsg time expired, ko = 4294967295 > Oct 11 14:00:29 sg2 kernel: block drbd0: [drbd0_worker/2197] > sock_sendmsg time expired, ko = 4294967294 > Oct 11 14:00:35 sg2 kernel: block drbd0: [drbd0_worker/2197] > sock_sendmsg time expired, ko = 4294967293 > Oct 11 14:00:41 sg2 kernel: block drbd0: [drbd0_worker/2197] > sock_sendmsg time expired, ko = 4294967292 > Oct 11 14:01:16 sg2 kernel: block drbd0: [drbd0_worker/2197] > sock_sendmsg time expired, ko = 4294967295 > Oct 11 14:02:05 sg2 kernel: block drbd0: [drbd0_worker/2197] > sock_sendmsg time expired, ko = 4294967295 > Oct 11 14:02:11 sg2 kernel: block drbd0: [drbd0_worker/2197] > sock_sendmsg time expired, ko = 4294967294 > Oct 11 14:02:17 sg2 kernel: block drbd0: [drbd0_worker/2197] > sock_sendmsg time expired, ko = 4294967293 > Oct 11 14:33:41 sg2 kernel: block drbd0: Out of sync: start=3932979480, > size=8 (sectors) > Oct 11 14:34:46 sg2 kernel: block drbd0: Out of sync: start=3946056120, > size=8 (sectors) > Oct 11 15:37:07 sg2 kernel: block drbd0: Out of sync: start=4696809024, > size=8 (sectors) > Oct 11 17:08:15 sg2 kernel: block drbd0: Out of sync: start=6084949528, > size=8 (sectors) > Oct 11 17:30:53 sg2 kernel: block drbd0: Out of sync: start=6567543472, > size=8 (sectors) > Oct 11 17:59:04 sg2 kernel: block drbd0: Out of sync: start=7169767896, > size=8 (sectors) > Oct 11 20:00:50 sg2 kernel: block drbd0: [drbd0_worker/2197] > sock_sendmsg time expired, ko = 4294967295 > Oct 11 20:01:09 sg2 kernel: block drbd0: [drbd0_worker/2197] > sock_sendmsg time expired, ko = 4294967295 > Oct 11 20:01:15 sg2 kernel: block drbd0: [drbd0_worker/2197] > sock_sendmsg time expired, ko = 4294967294 > Oct 11 20:01:29 sg2 kernel: block drbd0: [drbd0_worker/2197] > sock_sendmsg time expired, ko = 4294967295 > Oct 11 20:29:18 sg2 kernel: block drbd0: Out of sync: start=10362907296, > size=8 (sectors) > Oct 11 20:29:54 sg2 kernel: block drbd0: Out of sync: start=10375790488, > size=8 (sectors) > Oct 11 21:01:51 sg2 kernel: block drbd0: [drbd0_worker/2197] > sock_sendmsg time expired, ko = 4294967295 > Oct 11 21:42:15 sg2 kernel: block drbd0: Out of sync: start=11907921096, > size=8 (sectors) > Oct 11 21:43:38 sg2 kernel: block drbd0: Out of sync: start=11937086248, > size=8 (sectors) > Oct 11 21:44:00 sg2 kernel: block drbd0: Out of sync: start=11944705032, > size=8 (sectors) > Oct 11 21:49:26 sg2 kernel: block drbd0: Out of sync: start=12062270432, > size=8 (sectors) > Oct 11 22:07:10 sg2 kernel: block drbd0: Out of sync: start=12440235128, > size=8 (sectors) > Oct 11 22:58:54 sg2 kernel: block drbd0: Out of sync: start=13548501984, > size=8 (sectors) > Oct 11 23:23:17 sg2 kernel: block drbd0: Out of sync: start=14072873320, > size=8 (sectors) > $ date > Sat Oct 11 23:28:11 PDT 2014 > > its 35% done at this point... 15 4K blocks out wrong of 1/3rd of 16TB > isn't a lot, but its still more than I like to see. > > $ cat /proc/drbd > version: 8.3.15 (api:88/proto:86-97) > GIT-hash: 0ce4d235fc02b5c53c1c52c53433d11a694eab8c build by > phil at Build64R6, 2012-12-20 20:09:51 > 0: cs:VerifyS ro:Secondary/Primary ds:UpToDate/UpToDate C r----- > ns:0 nr:105707 dw:187685496 dr:654444832 al:0 bm:1 lo:107 pe:2104 > ua:435 ap:0 ep:1 wo:f oos:60 > [=====>..............] verified: 34.6% (9846140/15051076)M > finish: 14:55:27 speed: 187,648 (155,708) want: 204,800 K/sec > > > > > really, if I let this complete, then disconnect/reconnect the replica, > it will repair these glitches ? I'm gathering I shoudl schedule these > verifies weekly or something. That the backing device of one node fell out of sync is a cause concern. "Weekly" scan might be a bit much, but monthly or so isn't unreasonable. Of course, as you're seeing here, it's a lengthy process and it consumes non-trivial amounts of bandwidth and adds a fair load to the disks. How long was it in production before this verify? I can't speak to backuppc, but I am curious how you're managing the resources. Are you using cman + rgmanager or pacemaker? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?