一套SUNOS上的2节点10.2.0.2 RAC系统日前出现ORA-00600: internal error code, arguments: [keltnfy-ldmInit], [46], [1], [], [], [], [], []内部错误,错误发生时系统操作人员误使用hostname命令修改了1号主机的主机名,之后陆续出现以上ora-00600错误,同时操作系统日志显示RAC CSS进程意外终止,具体日志如下:
================== OS Message=====================
Jan 10 11:15:10 cupd25k-a root: [ID 702911 user.error] Cluster Ready Services completed waiting on dependencies.
Jan 10 11:15:16 cupd25k-a root: [ID 702911 user.error] Duplicate Oracle CLSMON found. Killing and restarting it.
Jan 10 11:15:16 cupd25k-a root: [ID 702911 user.error] Oracle CSS daemon failed to start up. Check CRS logs for diagnostics.
Jan 10 11:15:16 cupd25k-a root: [ID 702911 user.error] Oracle CLSMON terminated with unexpected status 137. Respawning
/* 这里的Duplicate Oracle CLSMON found 因该指的是OCLSMON进程,
"In Oracle 10.2.0.2 and above there is an additional process called OCLSOMON
which monitors the CSS daemon for hangs or scheduling issues and can reboot a
node if there is a perceived hang. OCLSOMON is spawned in init.cssd and runs
as the Oracle user."
oclsmon进程在10.2.0.2以后版本被引入,用以监视css进程,
若发生hang或操作系统调度问题时该进程可能会reboot节点,
oclsmon进程会被init.cssd脚本spawned. */
==================oclsmon.log======================
2011-01-10 11:15:11.376
unspecified member number is (1)
Member 1 group OCLSMON_ in use. Is oclsmon already up?
2011-01-10 11:15:11.479
Internal Error Information:
Category: 8
Operation: skgxnreg: the member number is i
Location: skgxnreg_7
Other:
Dep: 1
2011-01-10 11:15:11.737
unspecified member number is (1)
Member 1 group OCLSMON_ in use. Is oclsmon already up?
2011-01-10 11:15:11.751
Internal Error Information:
Category: 8
Operation: skgxnreg: the member number is i
Location: skgxnreg_7
Other:
Dep: 1
2011-01-10 11:15:12.006
unspecified member number is (1)
Member 1 group OCLSMON_ in use. Is oclsmon already up?
2011-01-10 11:15:12.023
Internal Error Information:
Category: 8
Operation: skgxnreg: the member number is i
Location: skgxnreg_7
Other:
Dep: 1
2011-01-10 11:15:12.278
unspecified member number is (1)
Member 1 group OCLSMON_ in use. Is oclsmon already up?
2011-01-10 11:15:12.293
Internal Error Information:
Category: 8
Operation: skgxnreg: the member number is i
Location: skgxnreg_7
Other:
Dep: 1
/* skgxn是Oracle Clusterware用以监视skgxn事件(即第三方CLUSTERWARE相关的事宜,他们应该有用sun的cluster);
似乎是修改hostname导致了Oracle CSS出现了fatal error,并启动了一个以上的OCLSMON进程(Duplicate Oracle CLSMON found),
最后"Oracle CSS daemon failed to start up. Check CRS logs for diagnostics",
在Oracle instance启动的情况下25k-a节点的CSS进程意外终止,
可能导致该节点上的所有实例的LMD(global Enqueue Service daemon)、LMON无法正常工作而导致实例hang住。*/
==========================alert.log====================
Errors in file /oracle/oracle/admin/BOCPCS/udump/bocpcs1_ora_12320.trc:
ORA-00600: internal error code, arguments: [keltnfy-ldmInit], [46], [1], [], [], [], [], []
=========================part of trace file===============
*** 2011-01-10 11:11:02.957
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [keltnfy-ldmInit], [46], [1], [], [], [], [], []
Current SQL information unavailable - no session.
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedmp()+716 CALL ksedst() FFFFFFFF7FFF9D40 ?
000000000 ? 0FFFFFFFF ?
FFFFFFFF7FFF8EE8 ?
FFFFFFFF7FFFA640 ?
000000008 ?
kgerinv()+200 PTR_CALL 0000000000000000 000000002 ? 10638A1CC ?
000000001 ? 000000000 ?
10638A000 ? 10638A1CC ?
kgeasnmierr()+28 CALL kgerinv() 106384B98 ? 000000000 ?
105D3B940 ? 000000002 ?
FFFFFFFF7FFFDFF0 ?
000001430 ?
keltnfy()+784 CALL kgeasnmierr() 106384B98 ? 1064DCBF0 ?
105D3B940 ? 000000002 ?
000000000 ? 00000002E ?
kscnfy()+552 PTR_CALL 0000000000000000 10639B498 ? 38001E7A8 ?
1055AC5D0 ? 10639B498 ?
000102C00 ? 10638A1C0 ?
ksucrp()+2436 CALL kscnfy() 000008000 ? 000808214 ?
100C4C220 ? 1055C6680 ?
00000000F ? 000000001 ?
opiino()+2056 CALL ksucrp() 000106387 ? 380007608 ?
000000000 ? 000380000 ?
000106000 ? 106387618 ?
opiodr()+1488 PTR_CALL 0000000000000000 10555A000 ?
FFFFFFFF7FFFF1C8 ?
00010555A ? 000106000 ?
105C83000 ? 000000001 ?
opidrv()+828 CALL opiodr() 106391000 ? 000000000 ?
106390DD8 ? 106390000 ?
106391BD0 ? 000106000 ?
sou2o()+80 CALL opidrv() 106394358 ? 000000001 ?
00000003C ? 000000000 ?
00000003C ? 000106000 ?
opimai_real()+124 CALL sou2o() FFFFFFFF7FFFF788 ?
00000003C ? 000000004 ?
FFFFFFFF7FFFF7B0 ?
105C82000 ? 000105C82 ?
main()+152 CALL opimai_real() 000000002 ?
FFFFFFFF7FFFF888 ?
103F1BBCC ? 10632DB10 ?
002411E44 ? 000014400 ?
_start()+380 CALL main() 000000002 ? 000000008 ?
000000000 ?
FFFFFFFF7FFFF898 ?
FFFFFFFF7FFFF9A8 ?
FFFFFFFF7C700200 ?
/* 可以看到以上trace文件指出了no session,
在服务进程启动阶段遭遇了该keltnfy-ldmInit内部错误*/
metalink文档Startup Database Produces Ora-00600: [Keltnfy-Ldminit] [ID 336447.1]
介绍了该内部错误一般由主机上的不当网络配置引起,很显然使用hostname命令修改了一个无法解析的
主机名时可能引发该ORA-00600[keltnfy-ldmInit]内部错误。
Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 10.2.0.3 - Release: 10.2 to 10.2
Information in this document applies to any platform.
***Checked for relevance on 09-Jun-2010***
Symptoms
An startup nomount on Oracle 10g Release 2 database produces the following exception in alert log
Starting up ORACLE RDBMS Version: 10.2.0.1.0.
Errors in file /opt/oracle/10.2/admin/ORCL/udump/ORCL_ora_535.trc:
ORA-00600: internal error code, arguments: [keltnfy-ldmInit], [46], [1], [], [], [], [], []
USER: terminating instance due to error 600
Instance terminated by USER, pid = 535
Cause
The problem is related to getting host information.
In this case, ldmInit()/sldmInit() is failing with error 46 : LDMERR_HOST_NOT_FOUND
The following exception may also occur :
LDMERR_SOSD_INIT OSD init failed to be specific in these OSD failures
LDMERR_BAD_ADDR bad address when system call gethostname failed
LDMERR_HOST_NOT_FOUND gethostbyname system call fails
LDMERR_NO_SUPPORT when specific address type is not supported
Development has fixed two bugs so far regarding this issue
Bug:5438154 - Abstract: ORA-600[KELTNFY-LDMINIT] STARTING THE DB
Release Notes:
ldmInit returned LDMERR_HOST_NOT_FOUND for the machine huge alias list/address list
Workaround:
reduce the alais list of the machine
Bug:5486074 - Abstract: ORA-600 [KELTNFY-LDMINIT] WHEN DNS IS NOT AVAILABLE
Release Notes:
Internal error is raised by the Server Generated Alert subsystem when it can not determine Host Name or
Network Address. This can be caused by DNS server being unaavilable.
Solution
The fix for 5486074 will not fix any underlying error from gethostbyname(), it just change the internal error to a warning message :
"Warning: keltnfy call to ldmInit failed with error 46"
You will still need to fix the network config issue.
These are the check you can do verify the host information
Check permission on /etc/hosts
$ ls -l /etc/hosts
-rw-r--r-- 2 root root 194 Oct 17 2006 /etc/hosts
Check if /etc/hosts file is correctly configured
( all of this on one line ).
Check the hostname:
$ hostname
$ ping `hostname`
Make sure you are able to ping the hostname
Check if /etc/nodename is correctly configured
If you have DNS setup, ping is not a tool to diagnose DNS problem. A better tool to use is nslookup, dnsquery, or dig.
$ nslookup
$ nslookup
$ nslookup
The forward and reverse lookup should succeed and return consistent address/info.
Check nsswitch.conf
$ more nsswitch.conf
hosts: files dns
Make sure host lookup is also done through the /etc/hosts file and not just dns. It is recommended that FILES come first before DNS.
Also, check the resolv.conf. This makes sure that the DNS is working properly.
显然在生产主机上使用hostname命令是危险的,因为你很难保证你在打字的时候不会因为同事的一下拍击而输错,有人说在生产环境中rm命令因该被禁用,那么这种特殊待遇对hostname命令也适用,我们可以用什么来代替hostname查看主机名呢?选择可以有非常多,这里我推荐一种:
-bash-3.00$ oslevel -r
5300-07
-bash-3.00$ hostname
oracledatabase12g.com
-bash-3.00$ uname -n
oracledatabase12g.com
/* uname -n完全可以满足你的需要! */
That's great!