Friday, April 30, 2010

SRM Error :Failed to recover datastore

We had setup srm with 4.0 U1 and VM’s were on ESX3.5 U4. We setup replication across location and then decided to simulate DR using  “Test RUN” option.  It goes and mount the lun on ESX host fine but when  it try to recover VM’s it was failing with error “Error: Failed to recover datastore: ” .

We then tried to run this with console to NetAPP filer open and we found that

Filer 01> Fri Apr 23 05:04:36 EST [XYZNAP005: wafl.volume.clone.fractional_rsrv.changed:info]: Fractional reservation for    clone 'testfailoverClone_nss_v10745371_S_XYZESX013_14_15_16PP' was changed to 100 percent because guarantee is set to'file' or 'none'.

Fri Apr 23 05:04:37 EST [XYZNAP005: wafl.volume.clone.created:info]: Volume clone  testfailoverClone_nss_v10745371_S_XYZESX013_14_15_16PP of volume S_XYZESX013_14_15_16PP was created successfully.Creation of clone volume 'testfailoverClone_nss_v10745371_S_XYZESX013_14_15_16PP' has completed.

Fri Apr 23 05:04:37 EST [XYZNAP005: lun.newLocation.offline:warning]: LUN /vol/testfailoverClone_nss_v10745371_S_XYZESX013_14_15_16PP/lun12 has been taken offline to prevent map conflicts after a copy or move operation.

Fri Apr 23 05:04:37 EST [XYZNAP005: lun.newLocation.offline:warning]: LUN /vol/testfailoverClone_nss_v10745371_S_XYZESX013_14_15_16PP/lun11 has been taken offline to prevent map conflicts after a copy or move operation.

Fri Apr 23 05:04:37 EST [XYZNAP005: lun.newLocation.offline:warning]: LUN /vol/testfailoverClone_nss_v10745371_S_XYZESX013_14_15_16PP/lun9 has been taken offline to prevent map conflicts after a copy or move operation.

Fri Apr 23 05:04:37 EST [XYZNAP005: lun.newLocation.offline:warning]: LUN /vol/testfailoverClone_nss_v10745371_S_XYZESX013_14_15_16PP/lun10 has been taken offline to prevent map conflicts after a copy or move operation.

Fri Apr 23 05:04:37 EST [XYZNAP005: wafl.inode.fill.disable:info]: fill reservation disabled for inode 33411686 (vol testfailoverClone_nss_v10745371_S_XYZESX013_14_15_16PP).

Fri Apr 23 05:04:37 EST [XYZNAP005: wafl.inode.overwrite.disable:info]: overwrite reservation disabled for inode 33411686 (vol testfailoverClone_nss_v10745371_S_XYZESX013_14_15_16PP).

Fri Apr 23 05:04:38 EST [XYZNAP005: lun.map:info]: LUN /vol/testfailoverClone_nss_v10745371_S_XYZESX013_14_15_16PP/lun12 was mapped to initiator group srm_esx_host=0

Fri Apr 23 05:04:38 EST [XYZNAP005: app.log.info:info]: AMSVCS001PP: Disaster Recovery SAN Adapter Storage Replication Adapter 1.4: (2) Test-Failover-start Event: Disaster Recovery SAN Adapter executed Test-Failover-start operation with errors from OS major version = 5 ,minor version = 2 ,package = Service Pack 2 and build = 3790

Fri Apr 23 05:04:42 EST [XYZNAP005: iscsi.notice:notice]: ISCSI: New session from initiator iqn.2000-04.com.qlogic:qle4062c.lfc0852h55321.2 at IP addr 10.X.X.X

Fri Apr 23 05:04:48 EST [XYZNAP005: wafl.vol.full:notice]: file system on volume testfailoverClone_nss_v10745371_S_XYZESX013_14_15_16PP is full

Fri Apr 23 05:04:48 EST [XYZNAP005: scsitarget.write.failureNoSpace:error]: Write to LUN /vol/testfailoverClone_nss_v10745371_S_XYZESX013_14_15_16PP/lun12 failed due to lack of space.

NetAPP look at the error and told me that it is getting timed out during retry process and not really looks like space issue because “aggr” on which this lun was mounted had enough space.

I decided to test it myself and created two lun of size 100GB and 90GB . These lun’s had few VM with around 75% of free space.  I ran SRM in test and DR mode and both worked great. This gives me enough reason to believe that this is caused by space and not due to some bug.

I called NetAPP and shown him what actually I am doing. At this point they ran following command

Filer > df -r testfailoverClone_nss_v10745371_S_XYZESX001PP  (This is actually the cloned  volume which SRM were trying to mount). It found that that fractional space is filled and because of which cloned lun were not able to mount

clip_image001

During these test I understand that if  protected lun is totally filled and then you try to run test SRM (FlexClone mechanism is used) against it . You have to make sure that at recovery site the volume should double the size as it try to mount the cloned lun on the same volume

No comments: