I have two Dell R710s and two PS4100 arrays in a group connected via two Dell 7048 switches. Each of the R710s connects to the iSCSI switches via two broadcom NICs with 9000 MTU size. Both servers are running the Equallogic Host Integration Tools version 3.5.1 and Server Core 2008 R2 Enterprise.
The PS4100s consist of a 12 drive unit, and a 24 drive unit. Load balancing is enabled, and the SANs are both configured for RAID 50.
On the Dell 7048 switches I also have four VLANs. One for iSCSI traffic, one for Live Migration Traffice, one for CSV traffic, and another for Heartbeat traffic. Management/Client traffic is not exposed to the SAN switches in any way. I realize the CSV and Heartbeat links are redundant, but I had the extra NICs, so I figured I'd use them. The CSV links in the servers are set with a metric of 900, and the remaining cluster networks are all set with AutoMetrics. I've also enabled the iSCSI feature on the 7048s and turned of STP.
I ran the Cluster Validation wizard and everything passed without issue after creating a 3.0TB LUN as a CSV. I created a test VM and ran some performance benchmarks, and some simulated hardware failures to ensure that live migration was successful. As a result of the tests I decided to migrate a production virtual machine into the cluster to ensure performance was good. The VM in question runs our network file storage including our Distributed File System. Previously it resided on an R905 with 5 7200 RPM drives in a RAID 5. Performance on the R905 was adequate, but we were hoping for an improvement by moving to more modern hardware. The VM in question has 2.0TB of storage space assigned to it, and is currently using 1.5TB of that.
After migration into the cluster; CPU and Memory performance are excellent, but disk performance is downright horrible. A standard file copy of a large file from disk-to-disk within the virtual machine slows to a crawl. The Windows dialog box indicates that the file(s) are copying at 8-15MB/s and a 3GB file can take over an hour to copy disk-to-disk. Comparing this with the original RAID 5 where we achieved 150MB/s on a disk-to-disk copy we've experienced a significant performance drop.
As a test I also switched from the Host Integration Tools to the Broadcom iSCSI offload connectors which did not improve performance in a measurable way.
To eliminate the possibility that the produciton VM has a problem I created a second Server 2008 R2 Virtual Machine from scratch to run synthetic benchmarks on using HDTunePro. The synthetic benchmark doesn't seem to be any better. The minimum and maximums are all over the map.
Disk-to-disk file copies also suffer the same peformance hit on the testing VM too.
There must be a configuration problem somewhere, but I have been working on this for two days and I can't see any issues. I'm following the recommended configuration from both Dell, and Microsoft. Unless I've missed something glaringly obvious, I am completely lost!
I forgot to include the Benchmark Images from the Test VM.
Here is the standard HDTunePro benchmark:
Here is the file writing benchmark:
Here is the Random Access Benchmark:
As you can see there is a high amount of variation in the benchmarks. The speeds are far from consistent and the access time fluctuates a great deal.
A few items to check first:
On the Switchs, ensure you have the latest version of the firmware installed, this should be v220.127.116.11 or later. Try various combinations of the following: disable/enable the iSCSI optimization setting, enable/disable jumbo, enable/disable flow-control. If I were to choose a starting point, it would be to disable the iSCSI optimization, and jumbo first, then test.
Broadcom NIC’s, you didn’t specify what model, so please list you model, we may have a setting to tweek (if not done already).
If we don't get anywhere here, and you still have bad performance, I would suggest you open a support case with EqualLogic so one of the performance engineers can help you isolated the cause.
Social Media and Community Professional#IWork4DellGet Support on Twitter - @dellcarespro
Follow me on Twitter: @joesatdell
I also suggest testing outside of Hyper-V. Ideally a physical server and run IOmeter or SQLIO to verify that the switches and arrays are working correctly. "HDTUNE" isn't optimized for SAN testing. Like doing a Drag-n-drop copy it's single threaded. Determine if you have a HW or Hyper-V issue.
The Broadcom NICs are 5709C NetXtreme IIs. I have disabled RSS, and TCP Chimney as recommended in another thread, but if those features should be re-enabled please let me know.
I've now done the following:
1. Turned off flow control - Same results
2. Turned off iSCSI Optomizations - Same results
3. Turned off flow control and iSCSI Optomiziations - Same results
The switch firmware is at 18.104.22.168, and I will be upgraded it to 22.214.171.124 tonight.
I'm in the midst of trying to find a physical server to connect to the SAN, but this may take a few days as I don't have anything set up at the moment.
I am also in the midst of benchmarking a directly attached LUN with IOMeter and SQLIO. This disk is being passed through to a Hyper-V VM bypassing the Host Opertating System's storage stack.
I will report back when I have more testing done.
After some more testing with directly attached LUNs, I consulted this document again http://www.equallogic.com/WorkArea/DownloadAsset.aspx?id=10771 and I think I may have resolved my problems.
netsh int tcp set blobal autotuninglevel=disabled
Before making the switch changes all three LUN types were performing horribly.
After making these changes; Disk-to-disk file copies, HDTunePro, and IOMeter synthetic benchmarks seem to be much faster on a Hyper-V VHD in the CSV, a Hyper-V VM with a pass-through LUN, and a Server with a Directly attached LUN. The directly attached LUNs are benchmarking faster than the Hyper-V VHDs, but I believe this difference is Hyper-V overhead.
Based on these results I am inclined to believe that the switches were misconfigured. I'm assuming I missed a setting mentioned in the document the first time I configured everything. I'm still not certain if all of my settings are 100% correct, though. I want to ensure that this cluster has optimal performance so I can duplicate the settings on the next cluster I build. Should I be disabling anything in the Broadcom Control Suite? Are their any other global interface options that should be disabled via netsh?
Lastly, I also have to do the firmware upgrade on the switches, so that may increase performance as well.
Thanks again for your help.
With current PowerConnect Firmware, the "iscsi on" is designed to preset the correct settings like Flowcontrol, portfast, etc... It does not set up a VLAN so for Jumbo Frame use, it needs to be manually.
Thanks for update.
I've manually set Jumbo Frames on the interfaces, and manually assigned a group of 24 ports to a VLAN dedicated to iSCSI traffic. I've also forced the subnet in use for iSCSI to use only the VLAN assigned.
We also have a document on some recommended setting for the Broadcom NIC as well that you may want to review. I'll send it to you via the forum send mail feature.
Just some quick thoughts...
Does performing the same test from the Hyper-V host against the same CSV produce the same result?
Running the test on non-CSV - same result?
Are the VM's disks VHDs? Or, are they direct-connect (using VM's iSCSI initiator to connect to Storage)?
If VHDs, have you tried the same 'test' with 'Fixed'-type VHDs?
I'd also be interested in the Broadcom NIC article. We've got NetXtreme II BCM5709s connected to a PS6100XS.
I have sent you the document via the forum "send mail". I'll see if I can get this posted to a public location so that in the future I can just add the link to this forum thread.
Is there any chance I could get that document as well, I have been having some issue with out iSCSI
All set, I sent you the information via the forum email.
I would be interested if you are willing to perform some tests beginning at zero. which means a lot of work:
- put it ALL in default vlan, NO extra vlans
- disable jumbo frames at initiator AND at switch level, re-adjust the switch mtu to standard (very important when operating in default vlan)
- disable MPIO at initiator level, in fact, don´t use HIT or OS MPIO at all (just for this test)
- disable iscsi optimization on the switches
- enable flow control on the switches
- disable ANY broadcast storm control or arp spoofing protection on the switches
- do NOT use broadcom hardware acceleration or dedicated hardware iscsi initiator
- just use software iSCSI initiator
- check your EQL at advanced settings to see if somehow DCB was turned on - if yes, turn it off!
- give it LOAD. And as you do, check SANHQ for disk latency and also at disk level the queue depth
- also, in SANHQ check for NIC errors during load.
- you don´t have by any chance some Intel NICs at hand which you could swap against the broadcoms (just for this test)?