Sunday, September 8, 2019

PREVENTING ASM DISKGROUP UNABLE TO DROP _DROPPED_* DISK & WORKAROUND FOR "REBALANCE NOT RUNNING" WHEN MIGRATING TO NEW STORAGE.



The basic idea of ASM failgroup is to have diskgroup resilient to storage related hardware component failure (controller, pool, module, disk etc) or even a complete storage node failure by taking advantage of the redundancy at storage level.

for example, if you have two controllers in a storage, two failure group can be created for normal redundancy. when one controller fails, this does not cause downtime at ASM/DB instance level.
if you have two storage nodes (typical for extended/cross data centre RAC), two failure group can hold each storage node/site disks.

there other advantages such as disk site affinity.

The purpose for the note is to describe issue encounter during storage migration from storage vendor A to storage vendor B. This is related to bug 19471572.

What happen:
in other to migrate into a new storage, a simple procedure was used.
1. two failure group already exist
2. add one or two new failure groups from storage B.
3. drop old storage A.

According to oracle documentation "A normal redundancy disk group must contain at least two failure groups." but in this case, ASM refuse to drop failure group when the number of failure group is 3.

Below is test case:

=================================================CHECK DISK STATUS===============================


DISKGROUP_NAME       DISK_NAME      FAILGROUP      MOUNT_S MODE_ST STATE

------------------------------ ------------------------------ ------------------------------ ------- ------- --------

DATAMIRROR        DATAMIRROR_0000      NEW_STORAGE2     CACHED  ONLINE  NORMAL

DATAMIRROR        DATAMIRROR_0003      OLD_STORAGE     CACHED  ONLINE  NORMAL

DATAMIRROR        DATAMIRROR_0002      NEW_STORAGE1     CACHED  ONLINE  NORMAL



SQL> alter diskgroup DATAMIRROR drop disks in failgroup OLD_STORAGE;

alter diskgroup DATAMIRROR drop disks in failgroup OLD_STORAGE

*

ERROR at line 1:

ORA-15067: command or option incompatible with diskgroup redundancy


"force" option can be used but this put the diskgroup in an inconsistent state.

SQL>  alter diskgroup DATAMIRROR drop disks in failgroup OLD_STORAGE force;

Diskgroup altered.

SQL> @op

=================================================CHECK OPERATION STATUS===============================



   INST_ID GROUP_NUMBER OPERA PASSSTAT  POWER     ACTUALSOFAREST_WORK   EST_RATE EST_MINUTES ERROR_CODE CON_ID

---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------

 1      2 REBAL COMPACTWAIT      6       0

 1      2 REBAL REBALANCE WAIT      6       0

 1      2 REBAL REBUILDWAIT      6       0

 1      2 REBAL RESYNCWAIT      6       0

 2      2 REBAL COMPACTWAIT      6  6    0       0  0      0       0

 2      2 REBAL REBALANCE WAIT      6  6    0       0  0      0       0

 2      2 REBAL REBUILDRUN      6  6 1854   48001     168039      0       0

 2      2 REBAL RESYNCDONE      6  6    0       0  0      0       0



8 rows selected.


After the rebalancing, status reported as missing/forcing

=================================================CHECK DISK STATUS===============================

DISKGROUP_NAME       DISK_NAME      FAILGROUP      MOUNT_S MODE_ST STATE
------------------------------ ------------------------------ ------------------------------ ------- ------- --------
DATAMIRROR        _DROPPED_0003_DATAMIRROR      OLD_STORAGE     MISSING OFFLINE FORCING
DATAMIRROR        DATAMIRROR_0000      NEW_STORAGE2     CACHED  ONLINE  NORMAL
DATAMIRROR        DATAMIRROR_0002      NEW_STORAGE1     CACHED  ONLINE  NORMAL

ASMCMD> lsdg
State    Type    Rebal  Sector  Logical_Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  NORMAL  N         512             512   4096  1048576    262144    91352           131072          -19860              1             N  DATAMIRROR/
MOUNTED  NORMAL  N         512             512   4096  1048576    102400    85444                0           42722              0             N  xx/
MOUNTED  EXTERN  N         512             512   4096  1048576     51200    51088                0           51088              0             N  xxxx/
MOUNTED  NORMAL  N         512             512   4096  1048576    104448   103815                0           50901              0             Y  xxxxx/
MOUNTED  NORMAL  N         512             512   4096  1048576     40960    39854            20480            9687              0             N  xxxxxxx/
ASMCMD>


Workaround:

the idea to prevent the above situation, is to keep maximum of two failure group at any particular point in time. using the add/drop alter diskgroup statement below:


alter diskgroup DATAMIRROR

add failgroup  new_pool1 disk '/dev/mapper/newdisk_01_p1'

drop  disks in failgroup P1 ;
alter diskgroup DATAMIRROR add failgroup  new_pool2 disk '/dev/mapper/newdisk01_01_p2'

drop disks in failgroup P2 ;

MISC:
1. if you already in this situation, you can create a quorum failgroup disk but consider your infrastructure support this properly i.e. consider if you really have third independent redundancy component at storage level like first and second failure group. this principle should work but not tested.

2. You can estimate the work by using dynamic performance view from V$ASM_ESTIMATE but I am disappointed that the view does not show time based value, only number of allocation unit.


SQL> explain work set statement_id='storage_mig_task1' for

alter diskgroup DATAMIRROR add failgroup  new_pool2 disk '/dev/mapper/newdisk01_01_p2'

drop disks in failgroup P2;

Explained.


SQL> select est_work,GROUP_NUMBER from v$asm_estimate where statement_id='storage_mig_task1';

EST_WORK

----------
169

Saturday, August 31, 2019

Running SLOB On Oracle Pluggable Database.



Running SLOB On Pluggable Database.



Why this note: as of Version 2.4.2.1, SLOB do not support execution on Oracle PDB. below error generated when you try to execute SLOB against pluggable DB.



NOTIFY  : 2019.08.09-20:20:31 : Testing Admin connect using "sqlplus -L system/orcl@pdb"
FATAL   : 2019.08.09-20:20:36 : 
FATAL   : 2019.08.09-20:20:36 : This version of SLOB does not support Oracle Multitenant Option
FATAL   : 2019.08.09-20:20:36 : 



What the Workaround?


edit setup.sh and comment "exit 1" as showed below.

if ( ! f_check_mto "$ADMIN_CONNECT_STRING" 2>&1 >> $LOG)
then
        f_msg FATAL ""

        f_msg FATAL "This version of SLOB does not support Oracle Multitenant Option"

        f_msg FATAL

        #remove Multitenant check
        #exit 1

fi



Problem 2: No AWR report after SLOB executed.



AWR report are not capture at Root Container level, only in PDB (obviously, *_SERVICE_BASE are pointed to PDB). SLOB is designed to run AWR in the non-multitenant and its query DBA_HIST_SNAPSHOT rather than AWR_PDB_SNAPSHOT. AWR snapshot created at PDB level but SLOB execute AWR report at CDB level.

Workaround:  this solution going to enable auto collection of AWR at CDB level and you can manually collect AWR PDB report for the same time frame.
there are advantages for collecting at CDB level e.g. having overall performance data, contention between PDB for resource can be spotted with CDB AWR report. Also, I noticed "Write IO (MB):" information is missing from Load Profile in PDB’s AWR report, although write I/O information is at "IOstat Section"

Using external stub in $SLOB/misc/external_script.sh, edit as follows:


#!/bin/bash

case "$1" in

        pre)

sqlplus SYSTEM/orcl@orcl19xxx <<EOF
spool cdb_external_script.log
SET TIME ON
EXEC DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT;

EXIT

EOF

 ;;

        post)

sqlplus SYSTEM/orcl@orcl19xxx <<EOF
spool cdb_external_script.log
SET TIME ON
EXEC DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT;

EXIT

EOF

 ;;

        end) : ;;

esac

exit 0


without above mechanism, error below will be generated in slob_debug.out.  other methods will involve modifying more scripts!


declare

declare

*

*

ERROR at line 1:

ERROR at line 1:

ORA-20200: End Snapshot Id 87 must be greater than Begin Snapshot Id 87

ORA-06512: at line 38

ORA-20200: End Snapshot Id 87 must be greater than Begin Snapshot Id 87

ORA-06512: at line 38


Executing SLOB(summarized): for detail check Kevin Closson blog.


1. Edit slob.conf:


UPDATE_PCT=20
SCAN_PCT=0
RUN_TIME=300
WORK_LOOP=0
SCALE=80M
SCAN_TABLE_SZ=1M
WORK_UNIT=64
REDO_STRESS=LITE
LOAD_PARALLEL_DEGREE=2
THREADS_PER_SCHEMA=1
DATABASE_STATISTICS_TYPE=awr # Permitted values: [statspack|awr]
ADMIN_SQLNET_SERVICE="pdb"
SQLNET_SERVICE_BASE="pdb"
DBA_PRIV_USER="system"              
SYSDBA_PASSWD="orcl"
EXTERNAL_SCRIPT='/home/oracle/SLOB/misc/cdb_external_script.sh'
DO_HOTSPOT=FALSE
HOTSPOT_MB=8
HOTSPOT_OFFSET_MB=16
HOTSPOT_FREQUENCY=3
HOT_SCHEMA_FREQUENCY=0
THINK_TM_FREQUENCY=0
THINK_TM_MIN=.1
THINK_TM_MAX=.5

2. In PDB:
sql>  create tablespace test_io_nolog datafile '+DATA' size 1G NOLOGGING;

3. cd SLOB directory and run command to load data 
./setup.sh test_io_nolog 2

4. run slob workload
$ ./runit.sh -s 2 -t 8
 #./runit.sh -s <number-of-slob-schemas-to-test> -t <SLOB-threads-per-schema>


Some screenshots.





PDB do not contain write information in "load Profile"





Saturday, January 5, 2019

ORA-29740: evicted by instance number | Linux rp_filter

Issue encountered on Redhat 7.5 and Oracle 12.2 RAC.
Red Hat Enterprise Linux Server release 7.5 (Maipo)

After installation & creating oracle DB, only one instance can be started at the same time.

Red hat increased by default security against "IP spoofing from Distributed Denial-of-service (DDos)", this is was blocking KSXPPING (LMS cross instance pings activity).


Error received in the alertlog as below:

KSXPPING: KSXP selected for Ping
2019-01-03T17:58:11.401836+01:00
Errors in file /u02/base/diag/rdbms/ikeja/ikeja1/trace/ikeja1_lmon_365438.trc  (incident=57777) (PDBNAME=CDB$ROOT):
ORA-29740: evicted by instance number 2, group incarnation 12
Incident details in: /u02/base/diag/rdbms/ikeja/ikeja1/incident/incdir_57777/ikeja1_lmon_365438_i57777.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2019-01-03T17:58:11.955366+01:00
Errors in file /u02/base/diag/rdbms/ikeja/ikeja1/trace/ikeja1_lmon_365438.trc:
ORA-29740: evicted by instance number 2, group incarnation 12
Errors in file /u02/base/diag/rdbms/ikeja/ikeja1/trace/ikeja1_lmon_365438.trc  (incident=57778) (PDBNAME=CDB$ROOT):
ORA-29740 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u02/base/diag/rdbms/ikeja/ikeja1/incident/incdir_57778/ikeja1_lmon_365438_i57778.trc
2019-01-03T17:58:12.007832+01:00
IPC Send timeout to 2.3 inc 10 for msg type 65522 from opid 26
2019-01-03T17:58:12.011091+01:00
IPC Send timeout to 2.4 inc 10 for msg type 65522 from opid 27
2019-01-03T17:58:12.026169+01:00
IPC Send timeout to 2.6 inc 10 for msg type 65522 from opid 29
2019-01-03T17:58:12.031554+01:00
IPC Send timeout to 2.2 inc 10 for msg type 65522 from opid 25
2019-01-03T17:58:12.178083+01:00
LCK0 (ospid: 366513): terminating the instance due to error 481
2019-01-03T17:58:23.195079+01:00
Instance terminated by LCK0, pid = 366513

ikea1[](/u02/base/diag/rdbms/ikeja/ikeja1/trace)$ oerr ora 29740
29740, 00000, "evicted by instance number %s, group incarnation %s"
// *Cause: This instance was evicted from the group by another instance of the
//         cluster database group for one of several reasons, which may
//         include a communications error in the cluster and failure to issue
//         a heartbeat to the control file.
// *Action: Check the trace files of other active instances in the cluster
//          group for indications of errors that caused a reconfiguration.
ikea1[](/u02/base/diag/rdbms/ikeja/ikeja1/trace)$


Solution: is to reduce the level of security by setting rp_filter=2
Current values are:
$ sysctl -a|grep rp_filter|grep 'haip'
..
net.ipv4.conf.eth-haip_70.arp_filter = 0
net.ipv4.conf.eth-haip_70.rp_filter = 1
net.ipv4.conf.eth-haip_71.arp_filter = 0
net.ipv4.conf.eth-haip_71.rp_filter = 1

change value online with -w option of sysctl
sudo sysctl -w net.ipv4.conf.eth-haip_70.rp_filter=2
sudo sysctl -w net.ipv4.conf.eth-haip_71.rp_filter=2

to make it permanent, update using tuned or /etc/sysctl.d or /etc/sysctl.conf file.

HTH