We used a development server called physics0, which is a Dell M610 with a QLogic HBA running RHEL5.5. It was connected to a six-node XIV 2814 with 1T disks.
We started running four dd's:
$ ps axu | grep dd root 12994 23.9 0.0 63156 580 pts/0 R 11:13 4:22 dd if /dev/mapper/mpath1p1 of /dev/null root 12995 23.8 0.0 63156 572 pts/0 D 11:13 4:20 dd if /dev/mapper/mpath1p1 of /dev/null root 12996 24.0 0.0 63156 572 pts/0 R 11:13 4:23 dd if /dev/mapper/mpath1p1 of /dev/null root 12997 23.7 0.0 63156 576 pts/0 R 11:13 4:19 dd if /dev/mapper/mpath1p1 of /dev/null $top looked like this:
top - 11:43:35 up 4 days, 18:45, 3 users, load average: 4.00, 4.00, 3.40 Tasks: 214 total, 3 running, 211 sleeping, 0 stopped, 0 zombie Cpu0 : 0.4%us, 4.0%sy, 0.0%ni, 80.8%id, 14.8%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 10.6%sy, 0.0%ni, 53.1%id, 36.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 12.0%sy, 0.0%ni, 49.1%id, 38.9%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 1.0%sy, 0.0%ni, 95.6%id, 3.2%wa, 0.2%hi, 0.0%si, 0.0%st Cpu4 : 0.0%us, 1.0%sy, 0.0%ni, 95.6%id, 3.4%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 0.2%us, 26.4%sy, 0.0%ni, 7.8%id, 49.6%wa, 1.4%hi, 14.6%si, 0.0%st Cpu6 : 0.0%us, 26.9%sy, 0.0%ni, 7.6%id, 50.3%wa, 1.0%hi, 14.2%si, 0.0%st Cpu7 : 0.0%us, 3.2%sy, 0.0%ni, 95.6%id, 1.2%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8177436k total, 8130020k used, 47416k free, 7084504k buffers Swap: 2096472k total, 220k used, 2096252k free, 36764k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12996 root 18 0 63156 572 476 D 26.0 0.0 7:00.78 dd 12995 root 18 0 63156 572 476 D 22.8 0.0 6:57.09 dd 12997 root 18 0 63156 576 476 D 22.6 0.0 6:55.44 dd 12994 root 18 0 63156 580 476 R 22.2 0.0 6:58.79 dd 363 root 10 -5 0 0 0 R 2.8 0.0 1:17.65 kswapd0 4134 root 18 0 10228 740 628 S 0.2 0.0 0:18.81 hald-addon-storFirst we removed a disk. You can see the dip in IO and quick recovery: The graph above was made with the xivtop program (control click IOPS, Latency, BW to see all three at the same time).
We saw a graphical representation of the disk turning red on the XIV GUI and were then able to test it, add it, watch it rebuild. As it rebuilt we could get an update on the status with the xcli:
> > monitor_redist Type Initial Capacity to Copy (GB) Capacity Remaining to Copy (GB) -1210286092one Time Started Estimated Time to Finish Time Elapsed Redistribution 26 17 33 2010-05-05 11:25:31 0:04:08 0:02:02 > >We then had the Cisco MDS9124 disable access to Module4 port 2 on the XIV. There was 100% a drop in IO until the retry interval ran out and it switched to an alternate path. IOs were queued until it passed them down a different path. The queue held it and I expect that it will hold it but we want to lower the retry interval so that it doesn't queue it as long. You can see the dip in IO on the graph. It queued it for 45 seconds: We also observed that the event log saw the available paths lesson:
> > event_list ... 2010-05-05 12:09:27 Informational HOST_MULTIPATH_OK Host 'nfile0' has redundant connections to the system. #paths=3 2010-05-05 12:12:15 Informational HOST_NO_MULTIPATH_ONLY_ONE_PORT Host 'physics0' is connected to the system through only one of its ports. #paths=2 2010-05-05 12:13:13 Informational HOST_MULTIPATH_OK Host 'physics0' has redundant connections to the system. #paths=4 > >Update: see Qlogic retry interval for how to implement instantaneous fail over as opposed to queuing for 45 seconds.
Unplugging the module was the most satisfying. I walked right up behind the unit and pulled both power cords out. The interruption in IO dip was longer as a result of loosing the module. Also IOPs averaged out not at 14,500 IOPs but 10,000 IOPs: The event log showed the unplugged module:
> > event_list ... 2010-05-05 12:46:14 Critical MODULE_FAILED 1:Module:4 failed. 2010-05-05 12:46:17 Informational DATA_REBUILD_STARTED Rebuild process started because system data is not protected. 1726767077764f the data must be rebuilt. 2010-05-05 12:46:20 Informational HOST_NO_MULTIPATH_ONLY_ONE_MODULE Host 'physics0' is connected to the system through only one interface module. #paths=2 2010-05-05 12:49:49 Informational DATA_REBUILD_COMPLETED Rebuild process completed. System data is now protected. 2010-05-05 13:16:37 Informational DISK_FINISHED_PHASEIN System finished phasing in 1:Disk:4:1. 2010-05-05 13:16:37 Informational DISK_FINISHED_PHASEIN System finished phasing in 1:Disk:4:2. ... 2010-05-05 13:16:38 Informational DISK_FINISHED_PHASEIN System finished phasing in 1:Disk:4:12. > >We can also observe what's "not ok" while the module is unblocked:
> > component_list filter=notok Component ID Status Currently Functioning 1:Disk:4:1 Failed no 1:Disk:4:2 Failed no 1:Disk:4:3 Failed no 1:Disk:4:4 Failed no 1:Disk:4:5 Failed no 1:Disk:4:6 Failed no 1:Disk:4:8 Failed no 1:Disk:4:9 Failed no 1:Disk:4:10 Failed no 1:Disk:4:11 Failed no 1:Disk:4:12 Failed no 1:Module:4 Initializing yes 1:Data:4 Failed no 1:Interface:4 Failed no 1:Remote:4 Failed no 1:Interface:6 Ready yes 1:Remote:6 Ready yes 1:Ethernet_Cable:4:1 Failed yes 1:Ethernet_Cable:4:2 Failed yes 1:Ethernet_Cable:4:7 Failed yes 1:Ethernet_Cable:4:8 Failed yes 1:Ethernet_Cable:4:9 Failed yes 1:Ethernet_Cable:4:10 Failed yes 1:Disk:4:7 Failed no > >and being added back in:
> > component_list filter=notok Component ID Status Currently Functioning 1:Interface:6 Ready yes 1:Remote:6 Ready yes 1:Disk:4:1 Phasing In yes 1:Disk:4:2 Phasing In yes 1:Disk:4:3 Phasing In yes 1:Disk:4:4 Phasing In yes 1:Disk:4:5 Phasing In yes 1:Disk:4:6 Phasing In yes 1:Disk:4:7 Phasing In yes 1:Disk:4:8 Phasing In yes 1:Disk:4:9 Phasing In yes 1:Disk:4:10 Phasing In yes 1:Disk:4:11 Phasing In yes 1:Disk:4:12 Phasing In yes > >It's good to know it can take a beating before being sent into production.
No comments:
Post a Comment