You know iSCSI and VMware.. I bet not… Part 1

One thing I noticed is that every single installation I have seen of iSCSI in VMware there has been crucial items missed in the setup that always rears its ugly head at the most inopportune time. I have poured over numerous blogs and best practices and it all becomes very dizzying. Heck throw in the fact that EVERY single manufacture has their own Best Practices it is no wonder people can’t get it right. I will attempt to document it in a step by step process and also highlight key things to keep in mind when building out iSCSI networks.

First and foremost before you even start on setting up iSCSI you really need to see what kind of networking you will need. Selecting the right networking switches is CRUCIAL to your overall success of an iSCSI implementation. Want to go with the SMB switches that is fine but beware if you are pushing any kind of I/O through it, it’s probably going to drop packets like a banshee. So let’s take a closer look at Cisco 2960 switches for argument’s sake we will use a Cisco 2960 -48 port switch.

Each portion of the switch highlighted above goes to a specific port Asic’s.

Putty into your switch and go into enable mode once there type in this: sh platform port-asic version (this is a output from a 3750 just to give you a visual)

This will then show you how many port asic you have in your switch. Typically for Cisco on a 48 port switch they break it out into the first 24 ports go to one port Asic (red box above) second set of 24 ports go to another port asic ( greenbox above) then finally the uplink ports are dedicated their own port asic (indicated by yellow box above). Again another way to tell this would be to run this command from your putty session: show platform pm if-numbers pay special attention to the column labeled Port as this will correlate to the Asic the corresponding interface belongs to.

So you’re probably wondering ok I thought this was a iSCSI post trust me it is but we really need to make sure we have our foundation solid before we build things out. Two final things when selecting your switch you need to make sure that your switch is able to handle bursty traffic part of being able to do this is having ample port buffers and whether those port buffers are shared or dedicated makes a huge difference. For instance on the 2960’s they have a 384K shared port buffer that is shared between 12 ports. So if you are going to be doing any sort of high traffic and virtualization through a 2960 just be cognizant that you are going to be walking a fine line depending on how many host you have and what the VM’s are doing. So one way to combat this is not to load up one port asic with all the iSCSI traffic. So ideally you would have at least 2 dedicated switches for iSCSI traffic only.

So let’s make a realistic scenario 3 ESXi host, VNXe and a pair of dedicated 2960 for iSCSI and another pair of 2960’s for production traffic.

Host 1

Host 2

Host 3

iSCSI switches

Infra-ISCSI 1

Infra-ISCSI 2

Prod switches

Infra- Prod1

Infra- Prod2

Here is what the cabling would look like for the three host

Host1 to Infra-ISCSI1 put 1 Ethernet port from Host1 into Gi1/1

Put the 1 Ethernet port from Host1 into Infra-ISCSI2 into Gi1/1

2 ports from Host 1 into Infra-Prod1 Gi1/1 and Gi1/2

Put 2 ports from Host 1 into Infra-Prod2 Gi1/1 and Gi1/2

Host2 to Infra-ISCSI1 put 1 Ethernet port from Host 2 into Gi1/25

Put the 1 Ethernet port from Host2 into Infra-ISCSI2 into Gi1/25

2 ports from Host2 into Infra-Prod1 Gi1/25 and Gi1/26

Put 2 ports from Host2 into Infra-Prod 2 Gi1/25 and Gi1/26

Host3 to Infra-ISCSI1 put 1 Ethernet port from Host3 into Gi1/2

Put the 1 Ethernet port from Host3 into Infra-ISCSI2 into Gi1/2

2 ports from Host 1 into Infra-Prod1 Gi1/3 and Gi1/4

Put 2 ports from Host 1 into Infra-Prod2 Gi1/3 and Gi1/4

VNXE

SPA Mgmt 1 port going to Infra-Prod1 Gi1/48

SPA iSCSI ports-1 port to Infra-ISCSI1 Gi1/48 and 1 port to Infra-ISCSI2 Gi1/24

SPB Mgmt 1 port going to Infra-Prod1 Gi1/48

SPB iSCSI ports- 1 port to Infra-iSCSI1 GI1/24 and 1 port to Infra-ISCSI2 Gi1/48

The above setup ensures that the ports are spread out evenly across the port asic and set you up for expansion with a blueprint of sorts already laid out. You might be looking at the host and going that is only 6 ports. So for ESXi there would be 2 for Mgmt/vMotion, 2 for Production Traffic and 2 for dedicated iSCSI traffic. If you just so happened to have 8 NICS per server then I would move vMotion to its own NIC’s. That is a lot of copper though if you are going to have 4 ESXi host that is 32 ports so you can eat up a ton of ports pretty quickly.

I used 2960’s for this demonstration but you may really want to look at Switches with much larger port buffers like 3750 or the 4850. If you want to get into the Nexus line on Cisco then 3K’s is where you would enter in. So now that we have our networking setup then we can move onto Part 2… I mean we have to get this right so we can use our new virtual VPLEX right?

Diagram of Above Port layout

Updating Your vSphere Host is Simple, What About The Underlying Foundation First

So you see the post from VMware (or insert any other OS you want here) and you say man I have to have that and I need it now! So you go out download it upgrade your host to the latest and greatest you are the cool kid on the block right? Wrong what you have just done is what every other Joe Blow on the planet would do, haphazardly go into something without regards to whether the foundation is ready for the new OS. Have you considered the implications of all the associated applications you are running? Like SRM, RecoverPoint, Avamar, Networker, Backup Exec, Trend Micro Deep Security, Symantec heck even Microsoft you just do not know what the upgrade is going to break. So this post is not going to be about all the associated applications you could break by doing a code upgrade haphazardly, no it starts back at the most basic thing Firmware.

Firmware you say… Jason you have lost your ever loving mind haven’t you? Everyone knows you do not mess with firmware once it’s running don’t go poking the bear are you crazy? Well I will have to say that I am slightly off kilter I mean heck I work in the technology field you have to be a little off kilter to push the envelope and try new and different ways to do things. Over the past few weeks I have seen this scenario play out at a number of customer’s sites we deal with and also those out of the blue customers that are like oh snap this is messed up who do I call? I am going to keep this post as neutral as possible but I will disclose that I work with UCS a good amount and I have worked with Dell, HP, IBM chassis systems too.

So let’s say you want to embrace going to ESXi 5.5 what in the world do you need to consider before you even get to installing or upgrading the first host?

1.) Does your SAN need to be upgraded to support the Operating System ( I put this first because this is probably going to take the most coordination)

2.) Are there any caveats for the Routing and Switching involved between your host and or SAN Fabric

· For instance are you using Cisco Nexus 1KV chances are you’re probably going to need to upgrade it

· How about those SAN Fiber Channel Switches bet it has probably been forever since you have upgraded it

· Core Switching involved I am sure it’s on the latest stable code release that the manufacture recommends right

3.) Moving up the stack into the Chassis does the Chassis firmware need to be upgraded?

· So if you have UCS what version of USCM are you running?

· Dell what version of CMC are you running?

· HP Onboard Administrator

· IBM Chassis Management Module

4.) Are there any update to the inner components to the chassis like maybe the firmware between the chassis and the slots for instance in a Dell Chassis. How about after you upgrade UCSM upgrading the IOMS? You get the idea…

5.) Now here is where it really starts getting squirrelly let’s say you are not using something like UCS where they have IOMS that connect the chassis to the Fabric Interconnects. Let’s say you are using a full fledge switch in your chassis then you are really going to want to look at upgrading the code that is running in that switch because you are upgrading all the components around it and lord only knows what effect it can have on things. This is where redundancy comes into play big time! Hopefully, you did not skimp on switches and NICS in your servers and you have true redundancy otherwise you’re going to be scheduling some down time for this upgrade.

6.) Just to clarify, step 5 addresses Route Switch inside the chassis. This step addresses if you have Fiber Switching built into your chassis then the same is going to be said for the code that is running on your FC Switches. Hopefully, your are not ISL’d between fabrics because otherwise then you have to take the code of the switches you are ISL’d with into account as they may need to be upgraded as well.

ALL THIS FOR JUST UPGRADING YOUR HOST TO A DIFFERENT OS!! Yep and wait there is more….

7.) Depending on the Manufacture you might be able to schedule updates for each Host inside the chassis for all the firmware that needs to be updated on the host in that chassis. I can remember back in the early M1000E days when it came out I had a bunch of M600’s and when the M620’s came out it was like ohh yay new stuff then I plugged it into the Chassis then I found out oh I need to do all the above steps before I can even get this to work. Mannnnnnnnnnnnnnnnnnnnn…… So the firmware updates for the host might include Remote Access Controller, BIOS, NIC Firmware, Controller Firmware, HD firmware updates if you have HD’s in your server. Check out your manufactures site they all are really good about showing you what exactly need to be upgraded to what levels to support a specific OS.

8.) Does my Multi-pathing software work with the newer version of the OS? IE: PowerPath (EMC), Dynamic Multipathing (Symantec)..fill in your favorite here

9.) Do you use a Hypervisor Antivirus product like Trend Micro for instance does it work with the newer version of the OS

10.) How about your backup software if you are doing guest level backups does the agent work with the OS version, if your are doing Image level backups does it work with your backup software

This list is no means all-encompassing it is merely just a shell to get you thinking about some things that really need to be considered and planned out prior to just loading up something new and throwing it into production. I have always been cool with testing stuff in a lab if it is available and I have the time, but I am realist some people do not have the time or the equipment to lab things out. So in that case I always go with if I set this up it could be production tomorrow so might as well do it right the first time. So really what we are talking about here is the foundation that runs your environment making sure you lay a solid foundation so everything else can excel!

As one of my co-workers (Alex Medina) so eloquently pointed out in a workshop we were doing one day…. “He asked everyone to close their eyes and picture your dream house. He then said open up your eyes, and asked everyone who imagined their foundation being built? Most of you envisioned the details and the wish lists but I am sure no one pictured the foundation being built the right way and without a good foundation, nothing you envisioned would be possible.” This is so true with a poor foundation no matter what you do after a poor foundation everything else is going to suffer. Take into account that you need to check your Foundation (Infrastructure) to make sure it supports the rest of the house (OS, Applications ).

Native Multi-Pathing Settings for ESXi 5.0 and EMC SAN’s

The other week I was doing a Root-Cause Analysis for a client that had a Data Unavailability outage when they were doing a SAN Upgrade. They had a EMC SAN, ESXi Host running 5.0 and also Pernix Data in the environment. So as I put the puzzle pieces together so did each of the manufactures of each of the products no surprise they all came back and said everything looks good we see where X happened and so it has to Y that did not handle things correctly. Not slamming anyone here it is just the nature of the beast really each component ties into the other component and each one has their own variables and best practices.

I literally started by digging through the VNX SP logs to see the sequence of the events from both Storage Processors and I saw where the disconnects happened on A and disconnects happened B. The timing was really close but there was still ample time between failovers of the SP’s. Next I looked at Pernix Data because I had not worked with Pernix in detail before I had to do a little research to figure out the inner workings not to mention I had their RCA in hand in pretty short order that spelled out what they saw from their side. So short story here is Pernix while it essentially inserts its pathing into VMware it literally takes whatever was set on the host and then sets is to that in Pernix. For instance if you have NMP set to RR then Pernix is going to show as PRNX_PSP_RR. My co-worker has an in-depth blog on Pernix here. Anyhow, I was able to quickly eliminate Pernix from the cause of the issue. So that left me with VMware and EMC to try and figure out what happened. So before I even went to VMware to see what they saw in Pernix’s RCA they showed that there was a APD- All Paths Down. How could that be? I just looked through the EMC logs and I could see where EMC failed over from SPA to SPB and I could see the paths come up. I was now really puzzled and even more intrigued. I could hear this little voice going hey remember there was that deal with 5.0 and EMC SAN’s back a few years ago something about pathing what the heck was it?

So I started looking for EMC Best Practices for VMware and I also consulted VMware’s Best Practices so see what they recommended for ESXi 5.0. Hmm now here is the funny thing both Best Practices made mention of ESXi 4.1, 5.1 and 5.5 NMP being set to Round Robin (RR) uhh ok where is 5.0? I did a lot more digging and reading and then I started scouring the internet for things I may have missed then I remembered Chad Sakacc made blog post about it so off to Chad’s blog I went here. Which confirmed what I had swirling around in my head back on the CX4 and VNX line if you used ESXi 5.0 the recommended setting was setting NMP to fixed. So I went back to EMC and asked their support to confirm whether this was still true on the VNX2 line and they confirmed that yes FIXED is still recommended for ESXi 5.0. So I saw a chart once that made it idiot proof so I tried to recreate that here.

ESXi Version VNX Software Revision NMP Recommended NMP PSP Selection
ESXi 4.1, 5.1 and 5.5 OE 31 or above Round Robin
ESXi 4.0 and 5.0 OE 31 or above Fixed

I will say that this only shows for VNX but the same does hold true for CX4 and VNX2 as well. One other thing you must ensure is that ALUA is enabled on the Array. Which this is done by making sure that when you go to the initiator records of the particular host that it is set to Clariion and Failover Mode 4 (FOM4)(ALUA). A great deal more of information for setting your array connections to ALUA here

To ESRS or not to ESRS that is the Question?

So a few weeks ago some co-workers and I were discussing the benefits of the VNXe now capable of being connected through EMC Secure Remote Support.  There was the normal talk about is it really secure or do they just say that?  Well sit back and let me delve into why I think all customers with a maintainance contract for EMC should take advantage of this service heck you have already paid for it!

First, let me explain what ESRS is for those of you that do not know.  ESRS is a windows server( physical or virtual) that is setup in your environment and then has ESRS software loaded on it by typically EMC CE.  The reason it takes someone from EMC to be able to set this application up on the dedicated windows server is because it requires an RSA token to be able to talk with EMC’s network.  So more over why ESRS?  Well it designed to identify and resolve potential issues before they impact your operations, this is done around the clock.  ESRS ensures the fastest response to a potential issues should they arise, it also helps with the escalation process and utlimately lowers resolution time of any issue you might have with your systems.

Next you might say well I do not want just anyone coming into my system without my knowledge we have a lot of private information.  So EMC has improved drastically over version 1 of ESRS and also kept some of the same policies and rules in place too.  EMC provides multiple layers of security with ESRS, first is they use FIPS 140-2 validated cyrptography.  It also uses 256 AES bit encrption for it’s notifications to EMC.  Communication between your site and EMC is bi-laterally authenticated using RSA®.  Only authorized support personnel verified with two factor authentication can download the digitatl certificates necessary view notifications from your site.  Next you have certain policies you can use to allow EMC remote personnel into your  ESRS system.  Those are Always Allow, Never Allow and Ask for Approval.

  • Always Allow-  Allows for  authorized EMC personnel to establish remote connections without having to wait for authorization
  • Never Allow-  Let’s you deny access to EMC personnel to specific systems
  • Ask for Approval-  When you want to be asked to grant permission for remote access

So you still wonder is that enough?  Nope they also have auditing so you know who, what and when someone connects to your system.  I dare say that all the above is a lot more security policies and practices that most businesses have in place these days.

So how many ESRS systems do you need for your company.  Typically one will suffice remember you are security conscious so why have multiple ESRS systems.  OK OK… you say well what the heck can I monitor with this ESRS?

  • Atmos
  • Avamar
  • Brocade B-Series Switches
  • Celerra
  • Centera
  • Cisco Switches
  • Clariion  CX, CX3, CX4, and AX4-5
  • Connetrix
  • Disk Library DLM and EDL
  • Data Domain
  • GreenPlum
  • Invista
  • RecoverPoint
  • Symmetrix 8000, DMX, DMX-3, DMX-4 and VMAX
  • VNX
  • VNXe
  • VPLEX

This is pretty much the list I have found from EMC but if you do not see a system on here I am pretty sure that they will be working on getting it added into this list on next code releases.   So why do I really think this should be a no-brainer?  Simply put most people get tied up doing their daily jobs  and they get a alert for a failed drive or replication breaking and they are like I will do that right after this…  Then all of a sudden you get caught up doing this that and the other and all of a sudden you have another disk failure this time it’s on the same RAID 5 LUN.  Well that is no good but had you had ESRS EMC would have already been able to diagnose and deliver a new drive so you would have never experienced a hiccup like a double disk failure on a RAID 5 LUN.  Oh how about code upgrades on systems don’t you love sitting watching a webex while a technican does his thing and in the mean time you have had more fun watching paint dry.  With ESRS EMC has the ability to remote in and complete code upgrades remotely upon request.

Last thing to note…  How much does ESRS cost?  It is zero dollar when you purchase a piece of equipment that supports ESRS and you also purchase maintainance on your new system purchase.  So what happens if you purchased a system in the last year and you did not get your ESRS serial number.  So make sure you get your serial number within the first year of purchase and that you get it installed before the end of the first year they got to have some cutoff ya know.  Anyhow if you did not get it on your original purchase just contact your reseller or avenue you purchased your EMC product and ask them for Zero Dollar ESRS.  You should only need to provide them the order number and they should be able to get it authorized and sent to you promptly.  Once you get your serial number and you have your server ready for installation give EMC a call if they have not already contacted you about getting your ESRS setup.  Once this is done the CE will typically come on-site to set up the ESRS, sometimes they do it through webex just depends.  Well there you go all you wanted to know about ESRS!