Sunday, September 8, 2019

PREVENTING ASM DISKGROUP UNABLE TO DROP _DROPPED_* DISK & WORKAROUND FOR "REBALANCE NOT RUNNING" WHEN MIGRATING TO NEW STORAGE.



The basic idea of ASM failgroup is to have diskgroup resilient to storage related hardware component failure (controller, pool, module, disk etc) or even a complete storage node failure by taking advantage of the redundancy at storage level.

for example, if you have two controllers in a storage, two failure group can be created for normal redundancy. when one controller fails, this does not cause downtime at ASM/DB instance level.
if you have two storage nodes (typical for extended/cross data centre RAC), two failure group can hold each storage node/site disks.

there other advantages such as disk site affinity.

The purpose for the note is to describe issue encounter during storage migration from storage vendor A to storage vendor B. This is related to bug 19471572.

What happen:
in other to migrate into a new storage, a simple procedure was used.
1. two failure group already exist
2. add one or two new failure groups from storage B.
3. drop old storage A.

According to oracle documentation "A normal redundancy disk group must contain at least two failure groups." but in this case, ASM refuse to drop failure group when the number of failure group is 3.

Below is test case:

=================================================CHECK DISK STATUS===============================


DISKGROUP_NAME       DISK_NAME      FAILGROUP      MOUNT_S MODE_ST STATE

------------------------------ ------------------------------ ------------------------------ ------- ------- --------

DATAMIRROR        DATAMIRROR_0000      NEW_STORAGE2     CACHED  ONLINE  NORMAL

DATAMIRROR        DATAMIRROR_0003      OLD_STORAGE     CACHED  ONLINE  NORMAL

DATAMIRROR        DATAMIRROR_0002      NEW_STORAGE1     CACHED  ONLINE  NORMAL



SQL> alter diskgroup DATAMIRROR drop disks in failgroup OLD_STORAGE;

alter diskgroup DATAMIRROR drop disks in failgroup OLD_STORAGE

*

ERROR at line 1:

ORA-15067: command or option incompatible with diskgroup redundancy


"force" option can be used but this put the diskgroup in an inconsistent state.

SQL>  alter diskgroup DATAMIRROR drop disks in failgroup OLD_STORAGE force;

Diskgroup altered.

SQL> @op

=================================================CHECK OPERATION STATUS===============================



   INST_ID GROUP_NUMBER OPERA PASSSTAT  POWER     ACTUALSOFAREST_WORK   EST_RATE EST_MINUTES ERROR_CODE CON_ID

---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------

 1      2 REBAL COMPACTWAIT      6       0

 1      2 REBAL REBALANCE WAIT      6       0

 1      2 REBAL REBUILDWAIT      6       0

 1      2 REBAL RESYNCWAIT      6       0

 2      2 REBAL COMPACTWAIT      6  6    0       0  0      0       0

 2      2 REBAL REBALANCE WAIT      6  6    0       0  0      0       0

 2      2 REBAL REBUILDRUN      6  6 1854   48001     168039      0       0

 2      2 REBAL RESYNCDONE      6  6    0       0  0      0       0



8 rows selected.


After the rebalancing, status reported as missing/forcing

=================================================CHECK DISK STATUS===============================

DISKGROUP_NAME       DISK_NAME      FAILGROUP      MOUNT_S MODE_ST STATE
------------------------------ ------------------------------ ------------------------------ ------- ------- --------
DATAMIRROR        _DROPPED_0003_DATAMIRROR      OLD_STORAGE     MISSING OFFLINE FORCING
DATAMIRROR        DATAMIRROR_0000      NEW_STORAGE2     CACHED  ONLINE  NORMAL
DATAMIRROR        DATAMIRROR_0002      NEW_STORAGE1     CACHED  ONLINE  NORMAL

ASMCMD> lsdg
State    Type    Rebal  Sector  Logical_Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  NORMAL  N         512             512   4096  1048576    262144    91352           131072          -19860              1             N  DATAMIRROR/
MOUNTED  NORMAL  N         512             512   4096  1048576    102400    85444                0           42722              0             N  xx/
MOUNTED  EXTERN  N         512             512   4096  1048576     51200    51088                0           51088              0             N  xxxx/
MOUNTED  NORMAL  N         512             512   4096  1048576    104448   103815                0           50901              0             Y  xxxxx/
MOUNTED  NORMAL  N         512             512   4096  1048576     40960    39854            20480            9687              0             N  xxxxxxx/
ASMCMD>


Workaround:

the idea to prevent the above situation, is to keep maximum of two failure group at any particular point in time. using the add/drop alter diskgroup statement below:


alter diskgroup DATAMIRROR

add failgroup  new_pool1 disk '/dev/mapper/newdisk_01_p1'

drop  disks in failgroup P1 ;
alter diskgroup DATAMIRROR add failgroup  new_pool2 disk '/dev/mapper/newdisk01_01_p2'

drop disks in failgroup P2 ;

MISC:
1. if you already in this situation, you can create a quorum failgroup disk but consider your infrastructure support this properly i.e. consider if you really have third independent redundancy component at storage level like first and second failure group. this principle should work but not tested.

2. You can estimate the work by using dynamic performance view from V$ASM_ESTIMATE but I am disappointed that the view does not show time based value, only number of allocation unit.


SQL> explain work set statement_id='storage_mig_task1' for

alter diskgroup DATAMIRROR add failgroup  new_pool2 disk '/dev/mapper/newdisk01_01_p2'

drop disks in failgroup P2;

Explained.


SQL> select est_work,GROUP_NUMBER from v$asm_estimate where statement_id='storage_mig_task1';

EST_WORK

----------
169