Thursday, September 19, 2013

Oracle RAC cluster NIC bonding.

You need to do following steps to start using NIC bonding with Oracle RAC (this add both public and private interconnect bond interfaces):

NOTE! This operation needs full downtime from cluster databases. And depending your environment you might need server reboots during this settings.

1. Add new bond0 and bond1 interfaces for RAC cluster globally via oifcfg. public interface is bond0 and cluster_interconnect interface bond1:



Get current network interface configuration being used by cluster as oracle user:
oifcfg getif




Set new bond interfaces – Updates OCR (-global make these changes to all nodes on cluster) run these in one node as oracle user (IPs can be same as before but interface name is changing):
oifcfg setif -global bond0/10.77.5.0:public
oifcfg setif -global bond1/10.37.24.0:cluster_interconnect

oifcfg getif



2. Stop databases and disable + stop crs on all nodes:


Do this on one node as oracle user (and all databases in cluster):
srvctl status database -d <database_name>  
srvctl stop database -d <database_name>
srvctl status database -d <database_name>  

Do these on both nodes as root user:
crsctl disable crs
crsctl check crs
crstcl stop crs
crsctl check crs
 


3.Change OS network interfaces to use NIC bonding like this (this guide only add bond0 but it is better to make bond0 (eth0 and eth1) for public and bond1 (eth2 and eth3 ) for interconnect. And you don't need to make alias interfaces.): 

http://www.oracle-base.com/articles/linux/nic-channel-bonding.php



If you need to reboot nodes because of your environment then do it now.








4. Enable and start crs on all nodes:

Do these on both nodes as root user:
crsctl enable crs
crsctl check crs
crstcl start crs
crsctl check crs
 check that crs started on both nodes run as root user on one node:
crsctl stat res -t

 

5. Remove old interfaces from RAC cluster via oifcfg:


Get current pvt. interconnect info as oracle user in one node:
oifcfg getif
 
Delete old interfaces eth0 and eth1 as oracle user in one node:
oifcfg delif -global eth0/10.77.5.0
oifcfg delif -global eth1/10.37.24.0

Check that only new bond interfaces are visible as oracle user in one node:
oifcfg getif


6. Repair scan- and vip- addresses using bond0 instead of eth0 :

Check cluster current network status as oracle user:
srvctl status nodeapps
Check current VIP settings. run this on all nodes as oracle user:
srvctl config vip -n <node_name>

Check current SCAN settings. run this on one node as oracle user:
srvctl config scan

Stop cluster nodeapps as oracle user:
srvctl stop nodeapps

Change SCAN/VIP settings. run this on all nodes as root user (set correct IP for all nodes):
srvctl modify nodeapps -n <node_name> -A 10.77.5.129/255.255.255.0/bond0

Check current VIP settings. run this on all nodes as oracle user:
srvctl config vip -n <node_name>

Start cluster nodeapps as oracle user:
srvctl stop nodeapps
Check cluster current network status as oracle user:
srvctl status nodeapps

Check current SCAN settings. run this on one node as oracle user:
srvctl config scan


 

7. Restart crs to see that it is starting correctly:


Do these on both nodes as root user: 

crsctl check crs
crstcl stop crs
crsctl check crs
crstcl start crs
crsctl check crs


8. Restart databases:

Do this on one node as oracle user (and all databases in cluster):
srvctl status database -d <database_name> 
srvctl start database -d <database_name> 
srvctl status database -d <database_name>


9. Test that you can connect into databases via all SCAN/VIP IPs. You can do this for example via sqlplus.

 

Tuesday, September 10, 2013

Oracle 11.2.0.3.0 DATABASE CRASHED DUE TO ORA-240 AND ORA-15064

There is a bug in Oracle 11.2.0.3.0 which can make your database instance restarting itself.
If you get following errors in database alert.log you know this bug is affecting your database:
ORA-00240: control file enqueue held for more than 120 seconds
ORA-29770: global enqueue process LCK0 (OSID 12329) is hung for more than 70 seconds

ORA-15064: communication failure with ASM instance

There is bugfix for this problem and you can download it from My Oracle Support (MOS) patch:
13914613

Other way to fix this is to update your database to newest version where this is also fixed

More info about this can be find from MOS documents:
Database Instance Crashes with ORA-15064 ORA-03135 ORA-00240 on 11.2 (Doc ID 1487108.1)

Bug 13914613 - Excessive time holding shared pool latch in kghfrunp with auto memory management (Doc ID 13914613.8)