Monday, November 23, 2009

Another Milestone Implementing SRM for Disaster Recovery

There are many FAQ available about SRM installation and configuration . One of them which I like is

http://www.yellow-bricks.com/srm-faq/

I would like to share my own personal experience. On Saturday 21st of Nov 2009 I have tested live DR scenario for our customer readiness .

This involved 26 Key application of our customer running across 17 Physical Database server and 45 virtual machines. We have excluded AD/Exchange. For exchange we used EMS system from Dell.

For SRM setup I used

1. 4 ESX host running on DL380 G5 with few of them has installed Qlogic HBA and other used S/W ISCSI.

2. 4 ESX host were installed with ESX3.5 U4 and both side I was running VC2.5 U5.

3. All the 4 ESX host were configured for HA/DRS.

4. Unfortunately all the 45 VM were spread across 33 lun. I could have consolidated for the SRM prospective.

5. For this purpose I have used SRMV1.0.1 with Patch 1. We are using NetAPP as our storage solution.

6. To save bandwidth we have kept protected and recovery filer at same location . Replicated and then ship it across to DR location. This will help us to save bandwidth for initial replication as we are dependent on incremental replication.

7. We are using  “Stretch VLAN “ so that we don’t have to re-ip the machine when it recovered at recovery site.

SRM setup and configuration:

1. Create a service account for SRM ,which will be used for setup SQL database for SRM and will be used during the installation of SRM

2.Install SRM database at each of the Virtual Center on SQL server. I configured DSN  prior though during the SRM setup it will prompt.

3.Install all the SRM patch for 1.0 and then install SRA(Storage Replication Adaptor). This Adaptor comes from the vendor whose storage solution you are using. You can download from VMware site only if you login to their account or else they don’t allow open download. In my case I used NetAPP SRA.

4. Once both site Protected Site and Recovery Site had the all the component installed we will start pairing the site. At Recovery site we need to provide Protected site IP and at Protected site we need to provide Recovery Site ip so that it can be paired. Use service account to pair the site.

5.Once the site is paired then we will need configure SRA for Protected site and Recovery site. Before this being configured we should insure that replication is completed . Check  with storage admin, or else we can not proceed further with configuration. This replicated also help us to create a bubble VM at recovery site.

Once it is confirmed that replication is over then configure the Array  by running “Configure Array  Managers” wizard.

a. For Protected site supply the IP address of the NetAPP filer. Ensure that you uses root or root like credential to authenticate. You must see replicated Array Pairs. If you have many filer which has lun/volume spread across then use the add button to add all those filer

clip_image002

b. Then we will configuring the same for Recovery site. Which all filer are paired ,insure that you add all the pair filer at the recovery site as well. Use the root/root like credential to add it . You can keep adding all the recovery site filer IP and then you will see green icon which states that Array has been paired properly

clip_image004

c. Once Protected Site and Recovery Site has been paired using replication array then we can find all the replicated datastores. If not

visible then rescan the array

clip_image006

After setup at Protected Site we need to do the same at Recovery Site. Protected Site and Recovery Site IP will be remain as we configured at Protected Site. You won’t be able to see Replicated Datastores at the Recovery Site as we are not configuring two way recovery site.

6. Now we need to configure the Inventory mapping at “Protected Site” .

clip_image008

This is very crucial and critical setup at Protected Site. Before we do inventories mapping we should consolidate all the VM into single folder by name something like SRM at Protected site.  Create similar folder at Recovery site so that we can have one to one mapping. We also need to make sure that if we are doing mapping for “Computer Resource” for Cluster host . It should have HA/DRS enabled or it will NOT allow to create bubble VM.

Assuming that we are using stretched VLAN and we have those VLAN created at Recovery Site.

This step is not required at Recovery Site.

7. We  now need to create Protection Group at Protected site and does not require at Recovery  site.

clip_image010

This protection group based each individual LUN and which VM’s on those lun needs to be protected. One LUN can be part of one protected group and cannot be mapped to any other protection group. Hence it is very important we need to classify the lun based on HIGH/NORMAL/LOW categories. Similarly we can plan for mapping. If we have done inventory mapping correctly then status will be shown as configured for protection

clip_image012

We can also run through wizard for configuring for each VM. Here we can configure for priories of startup as low/normal/high based on which VM should be started first.

clip_image014

If the replication is over then you can also see the status on Storage devices.

clip_image016

Also make a note that at Protected Site there should not any CD-ROM connection or Floppy connected or else automatic configuration will fail.

8. Once we have all the steps configured as above we can start with setup at of Recovery Plan @Recovery site. This recovery plan can be run in two mode test and recovery mode. Test mode is to insure that  you actual recovery run as per requirement. This runs using flex Clone (Incase of NetAPP storage)lun and recovery bubble network.  We need to ensure license for FlexClone at Recovery site filer. These test does not impact the Production system. After the test all the VM’s are powered off and LUN’s are resyched. Here we can move the VM in the priorities list to make sure it start first and then next VM and so on .

clip_image018

9. In the actual recovery mode it does attach the lun and then start powering on VM from high to low. Once this is overy you can export the report by clicking history

clip_image020

PANIC: SWARM REPLAY: log update failed

We have trying to implement SRM and I setup NetAPP 7.3 simulator on Ubantu . During the setup Netapp assign only 3 disk of size .

clip_image002

These agrr is intall by default when  simulator is installed. These are 120MB*3  disk. Make sure when the installation is completed you add extra disk to the aggr or else you will land up situation which I have stated above.

When you get above error  boot the system into maintenance mode and then create a new root volume. To get into maintenace mode re-run the setup and then “floppy boot” to Yes. And also 4a will intialize all the disks and you will loose data.

clip_image004

I have booted the system into maintenance mode and set the aggr0 (root volume) offline

clip_image006

Now create the new root volume with following command

clip_image008

Reboot the system and run ./runsim.sh . If you have followed all the steps then you should be able to get sim online. I got following error message as I did not followed 4a thing.

root@netappsim1:/sim# ./runsim.sh

runsim.sh script version Script version 22 (18/Sep/2007)

This session is logged in /sim/sessionlogs/log

NetApp Release 7.3: Thu Jul 24 12:55:28 PDT 2008

Copyright (c) 1992-2008 Network Appliance, Inc.

Starting boot on Thu Nov 12 14:30:07 GMT 2009

Thu Nov 12 14:30:15 GMT [fmmb.current.lock.disk:info]: Disk v4.18 is a local HA mailbox disk.

Thu Nov 12 14:30:15 GMT [fmmb.current.lock.disk:info]: Disk v4.17 is a local HA mailbox disk.

Thu Nov 12 14:30:15 GMT [fmmb.instStat.change:info]: normal mailbox instance on local side.

Thu Nov 12 14:30:16 GMT [raid.vol.replay.nvram:info]: Performing raid replay on volume(s)

Restoring parity from NVRAM

Thu Nov 12 14:30:16 GMT [raid.cksum.replay.summary:info]: Replayed 0 checksum blocks.

Thu Nov 12 14:30:16 GMT [raid.stripe.replay.summary:info]: Replayed 0 stripes.

Replaying WAFL log

.........

Thu Nov 12 14:30:20 GMT [rc:notice]: The system was down for 542 seconds

Thu Nov 12 14:30:20 GMT [javavm.javaDisabled:warning]: Java disabled: Missing /etc/java/rt131.jar.

Thu Nov 12 14:30:20 GMT [dfu.firmwareUpToDate:info]: Firmware is up-to-date on all disk drives

Thu Nov 12 14:30:20 GMT [sfu.firmwareUpToDate:info]: Firmware is up-to-date on all disk shelves.

Thu Nov 12 14:30:21 GMT [netif.linkUp:info]: Ethernet ns0: Link up.

Thu Nov 12 14:30:21 GMT [rc:info]: relog syslog Thu Nov 12 13:50:59 GMT [sysconfig.sysconfigtab.openFailed:notice]: sysconfig: table of valid configurations (/etc/sys

Thu Nov 12 14:30:21 GMT [rc:info]: relog syslog Thu Nov 12 14:00:00 GMT [kern.uptime.filer:info]:   2:00pm up  2:09 0 NFS ops, 0 CIFS ops, 0 HTTP ops, 0 FCP ops, 0 iS

Thu Nov 12 14:30:21 GMT [httpd.servlet.jvm.down:warning]: Java Virtual Machine is inaccessible. FilerView cannot start until you resolve this problem.

Thu Nov 12 14:30:21 GMT [sysconfig.sysconfigtab.openFailed:notice]: sysconfig: table of valid configurations (/etc/sysconfigtab) is missing.

Thu Nov 12 14:30:21 GMT [snmp.agent.msg.access.denied:warning]: Permission denied for SNMPv3 requests from root. Reason: Password is too short (SNMPv3 requires at least 8 characters).

Thu Nov 12 14:30:22 GMT [mgr.boot.disk_done:info]: NetApp Release 7.3 boot complete. Last disk update written at Thu Nov 12 14:21:08 GMT 2009

Thu Nov 12 14:30:22 GMT [mgr.boot.reason_ok:notice]: System rebooted after power-on.

Thu Nov 12 14:30:22 GMT [perf.archive.start:info]: Performance archiver started. Sampling 20 objects and 187 counters.

Check if  jave is disabled

filer> java

Java is not enabled.

If java is not enabled then FilerView wont work you need to re-install the simulator image.

I will explain about simulator reinstall using NFS /CFS once I tested it. Keep following my blog

Thursday, November 12, 2009

SnapShot with RDM

One of my co-worker asked : I have RDM disk can I take snapshot of it.
I told yes you can but depends which mode you have added RDM . It is not possible at all to take snapshot of a RDM in physical mode Only in Virtual mode

Find out more here

Setup quick Webserver

1. Download the mongoose installer from http://code.google.com/p/mongoose/
2. Install it on C:/mongoose
3. Go to C:/mongoose and double click on mongoose.exe to start the Webserver.

Thats it...

Now keep the folder which you wants to access over the http into C:/ . If you browse http://localhost:8080 then you should able to see content of the C:/ from
the browser.

Note : By default the mongoose Web Root folder will be set to c:\ you can modify it of your own folder.

Monday, November 9, 2009

I am now VMware Certified Professionla on vSphere 4

Yesterday I completed my VCP 410 with wooing 494 marks. Yes I missed 6 marks to score 500/500. In my last VCP310 I finished with same marks. Compare to VCP310 it was bit tough.
I have referred to VCP410 what’s new Student manual and configure maximum for VCP410. Sometimes I failed to understand regarding testing pattern. For example out of 4 answers only 2 will be correct as per VMware and we are sure that 3 of them are correct. But VMware will only accept the two which they think is best. I guess they should add intelligence to their system database and also allow accepting 3rd answer.

Wednesday, November 4, 2009

How to perform manual DR using VMWare and NetAPP?

Well I had been working on vSRM as you can find few reference from my blog about it. But I had been asked to see incase SRM fail how can we have “plan B”.
We used SNAP mirror technologies from NetAPP to accomplish “Plan B”
Here is what I suggested:
1. During initial setup snap mirror on LAN and then once it is completed plan to ship the filer to DR location. This way we can save some bandwidth for replication as it will be only changes which will get replicated.
2. With manual process we need to maintain few documents especially with all the luns which will get replicated across location along with the serial number. The reason will be explained below
3. Once the replication is over we will start preparing for DR testing. For testing purpose I have selected one ESX host with dummy VLAN’s
4. I broke the snap mirror relationship between filer for the volume which was interested in testing. Once the volume is broken it become active and all luns are visible on the filer. It will be with same lun name and lun number . But the serial number will be changes.
5. This being very scariest part of entire exercise . If the lun serial number do not match with that of primary site then lun will appear as blank lun. We need to one to one mapping for lun serial number as it should be matching with that of protected site.

Before you change the lun serial number at the recovery site we need to make the lun offline and then run following command to change the serial number
# lun serial serial number.

Eg: lun serial /vol/S_xxxx_011PP_vol1/lun1 12345

6. Once it has same serial number map the lun to correct igroup and rescan hba on ESX host. Once the rescan is completed all the lun and datastore will appear “AS IT IS ” at the recovery site.

7. We have to register all the VM in order to power on . This can be accomplished using script.

Happy DR.