营销网站建设需要多少钱,wordpress图片存储方案,网站模板asp,珠海有什么好的网站推广公司一、背景
在国庆期间巡检的时候#xff0c;发现数据库alert日志中出现了异常重启的信息#xff0c;当即对该报错进行分析处理。
二、处理过程
#xff08;1#xff09;数据库告警日志分析
node1 alert#xff1a; Sat Oct 05 13:05:14 2024 Thread 1 advanced to log …
一、背景
在国庆期间巡检的时候发现数据库alert日志中出现了异常重启的信息当即对该报错进行分析处理。
二、处理过程
1数据库告警日志分析
node1 alert Sat Oct 05 13:05:14 2024 Thread 1 advanced to log sequence 6981 (LGWR switch) Current log# 11 seq# 6981 mem# 0: DATA/ybqddb/onlinelog/group_11.302.1144593261 Sat Oct 05 13:05:15 2024 Archived Log entry 12130 added for thread 1 sequence 6980 ID 0x8d497377 dest 1: Sat Oct 05 14:50:48 2024 Reconfiguration started (old inc 27, new inc 29) List of instances: 1 (myinst: 1) Global Resource Directory frozen * dead instance detected - domain 0 invalid TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Sat Oct 05 14:50:48 2024 Sat Oct 05 14:50:48 2024 LMS 3: 1 GCS shadows cancelled, 1 closed, 0 Xw survived LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Sat Oct 05 14:50:48 2024 LMS 1: 1 GCS shadows cancelled, 0 closed, 0 Xw survived Sat Oct 05 14:50:48 2024 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Sat Oct 05 14:50:48 2024 Instance recovery: looking for dead threads Beginning instance recovery of 1 threads Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete parallel recovery started with 32 processes Started redo scan Completed redo scan read 76 KB redo, 16 data blocks need recovery Started redo application at Thread 2: logseq 5168, block 86069 Sat Oct 05 14:50:53 2024 Setting Resource Manager plan SCHEDULER[0x32DE]:DEFAULT_MAINTENANCE_PLAN via scheduler window Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter Recovery of Online Redo Log: Thread 2 Group 10 Seq 5168 Reading mem 0 Mem# 0: DATA/ybqddb/onlinelog/group_10.300.1144593255 Completed redo application of 0.01MB Completed instance recovery at Thread 2: logseq 5168, block 86222, scn 245107035 16 data blocks read, 16 data blocks written, 76 redo k-bytes read Sat Oct 05 14:50:53 2024 minact-scn: master found reconf/inst-rec before recscn scan old-inc#:29 new-inc#:29 Thread 2 advanced to log sequence 5169 (thread recovery) Redo thread 2 internally disabled at seq 5169 (SMON) Sat Oct 05 14:50:54 2024 Archived Log entry 12131 added for thread 2 sequence 5168 ID 0x8d497377 dest 1: Sat Oct 05 14:50:54 2024 ARC2: Archiving disabled thread 2 sequence 5169 Archived Log entry 12132 added for thread 2 sequence 5169 ID 0x8d497377 dest 1: minact-scn: master continuing after IR minact-scn: Master considers inst:2 dead Sat Oct 05 14:51:49 2024 Decreasing number of real time LMS from 4 to 0 Sat Oct 05 14:52:11 2024 Reconfiguration started (old inc 29, new inc 31) List of instances: 1 2 (myinst: 1) Global Resource Directory frozen Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Sat Oct 05 14:52:11 2024 LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Sat Oct 05 14:52:11 2024 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Sat Oct 05 14:52:11 2024 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Sat Oct 05 14:52:11 2024 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Sat Oct 05 14:52:11 2024 minact-scn: Master returning as live inst:2 has inc# mismatch instinc:0 cur:31 errcnt:0 Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete Sat Oct 05 14:53:26 2024 Increasing number of real time LMS from 0 to 4 Sat Oct 05 17:05:20 2024 ALTER SYSTEM ARCHIVE LOG Sat Oct 05 17:05:21 2024 Thread 1 advanced to log sequence 6982 (LGWR switch) Current log# 5 seq# 6982 mem# 0: DATA/ybqddb/onlinelog/group_5.290.1144593225 Sat Oct 05 17:05:21 2024 Archived Log entry 12134 added for thread 1 sequence 6981 ID 0x8d497377 dest 1: Sat Oct 05 21:05:22 2024 ALTER SYSTEM ARCHIVE LOG Sat Oct 05 21:05:22 2024 Thread 1 advanced to log sequence 6983 (LGWR switch) Current log# 7 seq# 6983 mem# 0: DATA/ybqddb/onlinelog/group_7.294.1144593235 Sat Oct 05 21:05:23 2024 Archived Log entry 12135 added for thread 1 sequence 6982 ID 0x8d497377 dest 1: Sun Oct 06 01:08:47 2024 ALTER SYSTEM ARCHIVE LOG Sun Oct 06 01:08:49 2024 Thread 1 advanced to log sequence 6984 (LGWR switch) Current log# 9 seq# 6984 mem# 0: DATA/ybqddb/onlinelog/group_9.298.1144593249 Sun Oct 06 01:08:49 2024 Archived Log entry 12138 added for thread 1 sequence 6983 ID 0x8d497377 dest 1: Sun Oct 06 05:05:18 2024 ALTER SYSTEM ARCHIVE LOG Sun Oct 06 05:05:18 2024 Thread 1 advanced to log sequence 6985 (LGWR switch) Current log# 11 seq# 6985 mem# 0: DATA/ybqddb/onlinelog/group_11.302.1144593261 Archived Log entry 12139 added for thread 2 sequence 5173 ID 0x8d497377 dest 1: Sun Oct 06 05:05:19 2024
node2 alert
Sat Oct 05 13:05:14 2024 Archived Log entry 12129 added for thread 2 sequence 5167 ID 0x8d497377 dest 1: Sat Oct 05 14:50:47 2024 NOTE: ASMB terminating Errors in file /u01/app/oracle/diag/rdbms/ybqddb/ybqddb2/trace/ybqddb2_asmb_15097.trc: ORA-15064: communication failure with ASM instance ORA-03113: end-of-file on communication channel Process ID: Session ID: 2109 Serial number: 3 Errors in file /u01/app/oracle/diag/rdbms/ybqddb/ybqddb2/trace/ybqddb2_asmb_15097.trc: ORA-15064: communication failure with ASM instance ORA-03113: end-of-file on communication channel Process ID: Session ID: 2109 Serial number: 3 ASMB (ospid: 15097): terminating the instance due to error 15064 Instance terminated by ASMB, pid 15097 Sat Oct 05 14:51:59 2024 Starting ORACLE instance (normal) ************************ Large Pages Information ******************* Per process system memlock (soft) limit UNLIMITED Total Shared Global Region in Large Pages 0 KB (0%) Large Pages used by this instance: 0 (0 KB) Large Pages unused system wide 0 (0 KB) Large Pages configured system wide 0 (0 KB) Large Page size 2048 KB 从节点2日志可以看出在14:50:47出现与asm实例通信失败报错并于14:51:59该节点实例开始启动。 2查看crs及ocssd日志
node1 crs日志
2024-07-26 03:02:59.119: [client(240419)]CRS-10051:CVU found following errors with Clusterware setup : PRVF-4657 : Name resolution setup check for nxybj-scan (IP address: 172.31.122.38) failedPRVF-4664 : Found inconsistent name resolution entries for SCAN name nxybj-scan
2024-10-05 14:50:31.281: [cssd(10595)]CRS-1612:Network communication with node nxybj2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.200 seconds 2024-10-05 14:50:38.282: [cssd(10595)]CRS-1611:Network communication with node nxybj2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 7.200 seconds 2024-10-05 14:50:43.283: [cssd(10595)]CRS-1610:Network communication with node nxybj2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.200 seconds 2024-10-05 14:50:45.484: [cssd(10595)]CRS-1607:Node nxybj2 is being evicted in cluster incarnation 587229053; details at (:CSSNM00007:) in /u01/app/11.2.0/grid/log/nxybj1/cssd/ocssd.log. 2024-10-05 14:50:47.994: [cssd(10595)]CRS-1625:Node nxybj2, number 2, was manually shut down 2024-10-05 14:50:47.999: [cssd(10595)]CRS-1601:CSSD Reconfiguration complete. Active nodes are nxybj1 . 2024-10-05 14:50:48.000: [crsd(14356)]CRS-5504:Node down event reported for node nxybj2. 2024-10-05 14:50:51.248: [crsd(14356)]CRS-2773:Server nxybj2 has been removed from pool Generic. 2024-10-05 14:50:51.249: [crsd(14356)]CRS-2773:Server nxybj2 has been removed from pool ora.orcl. 2024-10-05 14:50:51.249: [crsd(14356)]CRS-2773:Server nxybj2 has been removed from pool ora.ybqddb. 2024-10-05 14:50:51.249: [crsd(14356)]CRS-2773:Server nxybj2 has been removed from pool ora.ybzxddb. 2024-10-05 14:51:13.084: [cssd(10595)]CRS-1601:CSSD Reconfiguration complete. Active nodes are nxybj1 nxybj2 . 2024-10-05 14:51:42.117: [crsd(14356)]CRS-2772:Server nxybj2 has been assigned to pool Generic. 2024-10-05 14:51:42.118: [crsd(14356)]CRS-2772:Server nxybj2 has been assigned to pool ora.ybzxddb. 2024-10-05 14:51:42.118: [crsd(14356)]CRS-2772:Server nxybj2 has been assigned to pool ora.ybqddb. 2024-10-05 14:51:42.118: [crsd(14356)]CRS-2772:Server nxybj2 has been assigned to pool ora.orcl.
node1 ocssd日志 2024-10-05 14:50:28.553: [ CSSD][2946451200]clssnmSendingThread: sent 5 status msgs to all nodes 2024-10-05 14:50:31.281: [ CSSD][2948028160]clssnmPollingThread: node nxybj2 (2) at 50% heartbeat fatal, removal in 14.200 seconds 2024-10-05 14:50:31.281: [ CSSD][2948028160]clssnmPollingThread: node nxybj2 (2) is impending reconfig, flag 2229260, misstime 15800 2024-10-05 14:50:31.281: [ CSSD][2948028160]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1) 2024-10-05 14:50:31.282: [ CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693547, LATS 2409222280, lastSeqNo 108691578, uniqueness 1708519428, timestamp 1728111030/2407806020 2024-10-05 14:50:31.282: [ CSSD][3164333824]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693549, LATS 2409222280, lastSeqNo 108693378, uniqueness 1708519428, timestamp 1728111030/2407806140 2024-10-05 14:50:31.691: [ CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2409222690 2024-10-05 14:50:31.781: [ CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2409222780 2024-10-05 14:50:31.784: [ CSSD][3169081088]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693551, LATS 2409222790, lastSeqNo 108691579, uniqueness 1708519428, timestamp 1728111031/2407807110 2024-10-05 14:50:31.811: [ CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2409222810 2024-10-05 14:50:32.193: [ CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223200 2024-10-05 14:50:32.282: [ CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693553, LATS 2409223280, lastSeqNo 108693547, uniqueness 1708519428, timestamp 1728111032/2407807530 2024-10-05 14:50:32.291: [ CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223290 2024-10-05 14:50:32.321: [ CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223320 2024-10-05 14:50:32.701: [ CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223700 2024-10-05 14:50:32.783: [ CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693556, LATS 2409223790, lastSeqNo 108693553, uniqueness 1708519428, timestamp 1728111032/2407808040 2024-10-05 14:50:32.793: [ CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223790 2024-10-05 14:50:32.823: [ CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223820 2024-10-05 14:50:33.203: [ CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224200 2024-10-05 14:50:33.282: [ CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693559, LATS 2409224280, lastSeqNo 108693556, uniqueness 1708519428, timestamp 1728111033/2407808540 2024-10-05 14:50:33.301: [ CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224300 2024-10-05 14:50:33.331: [ CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224330 2024-10-05 14:50:33.554: [ CSSD][2946451200]clssnmSendingThread: sending status msg to all nodes 2024-10-05 14:50:33.554: [ CSSD][2946451200]clssnmSendingThread: sent 5 status msgs to all nodes 2024-10-05 14:50:33.711: [ CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224710 2024-10-05 14:50:33.784: [ CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693562, LATS 2409224790, lastSeqNo 108693559, uniqueness 1708519428, timestamp 1728111033/2407809050 2024-10-05 14:50:33.803: [ CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224800 2024-10-05 14:50:33.833: [ CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224840 2024-10-05 14:50:34.213: [ CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225210 2024-10-05 14:50:34.282: [ CSSD][3169081088]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693563, LATS 2409225280, lastSeqNo 108693551, uniqueness 1708519428, timestamp 1728111033/2407809130 2024-10-05 14:50:34.282: [ CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693565, LATS 2409225280, lastSeqNo 108693562, uniqueness 1708519428, timestamp 1728111034/2407809550 2024-10-05 14:50:34.311: [ CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225310 2024-10-05 14:50:34.341: [ CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225340 2024-10-05 14:50:34.721: [ CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225720 2024-10-05 14:50:34.784: [ CSSD][3164333824]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693567, LATS 2409225790, lastSeqNo 108693549, uniqueness 1708519428, timestamp 1728111034/2407809680 2024-10-05 14:50:34.784: [ CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693568, LATS 2409225790, lastSeqNo 108693565, uniqueness 1708519428, timestamp 1728111034/2407810050 2024-10-05 14:50:34.813: [ CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225810 node2 crs日志 [ctssd(11996)]CRS-2409:The clock on host nxybj2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode. 2024-10-03 05:57:32.243: [ctssd(11996)]CRS-2409:The clock on host nxybj2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode. 2024-10-05 14:50:31.036: [cssd(10538)]CRS-1612:Network communication with node nxybj1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.520 seconds 2024-10-05 14:50:39.038: [cssd(10538)]CRS-1611:Network communication with node nxybj1 (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.520 seconds 2024-10-05 14:50:43.039: [cssd(10538)]CRS-1610:Network communication with node nxybj1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.520 seconds 2024-10-05 14:50:45.560: [cssd(10538)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log. 2024-10-05 14:50:45.560: [cssd(10538)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log 2024-10-05 14:50:45.560: [cssd(10538)]CRS-1608:This node was evicted by node 1, nxybj1; details at (:CSSNM00005:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log. 2024-10-05 14:50:45.560: [cssd(10538)]CRS-1608:This node was evicted by node 1, nxybj1; details at (:CSSNM00005:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log. 2024-10-05 14:50:45.577: [cssd(10538)]CRS-1652:Starting clean up of CRSD resources. 2024-10-05 14:50:46.038: [cssd(10538)]CRS-1608:This node was evicted by node 1, nxybj1; details at (:CSSNM00005:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log. 2024-10-05 14:50:46.803: [/u01/app/11.2.0/grid/bin/oraagent.bin(14456)]CRS-5016:Process /u01/app/11.2.0/grid/opmn/bin/onsctli spawned by agent /u01/app/11.2.0/grid/bin/oraagent.bin for action check failed: details at (:CLSN00010:) in /u01/app/11.2.0/grid/log/nxybj2/agent/crsd/oraagent_grid/oraagent_grid.log 2024-10-05 14:50:47.404: [/u01/app/11.2.0/grid/bin/oraagent.bin(14456)]CRS-5016:Process /u01/app/11.2.0/grid/bin/lsnrctl spawned by agent /u01/app/11.2.0/grid/bin/oraagent.bin for action check failed: details at (:CLSN00010:) in /u01/app/11.2.0/grid/log/nxybj2/agent/crsd/oraagent_grid/oraagent_grid.log 2024-10-05 14:50:47.405: [cssd(10538)]CRS-1654:Clean up of CRSD resources finished successfully. 2024-10-05 14:50:47.406: [cssd(10538)]CRS-1655:CSSD on node nxybj2 detected a problem and started to shutdown. 2024-10-05 14:50:47.434: [/u01/app/11.2.0/grid/bin/oraagent.bin(14698)]CRS-5822:Agent /u01/app/11.2.0/grid/bin/oraagent_oracle disconnected from server. Details at (:CRSAGF00117:) {0:7:28777} in /u01/app/11.2.0/grid/log/nxybj2/agent/crsd/oraagent_oracle/oraagent_oracle.log. 2024-10-05 14:50:47.436: [ohasd(10322)]CRS-2765:Resource ora.crsd has failed on server nxybj2. 2024-10-05 14:50:47.437: [/u01/app/11.2.0/grid/bin/oraagent.bin(14456)]CRS-5822:Agent /u01/app/11.2.0/grid/bin/oraagent_grid disconnected from server. Details at (:CRSAGF00117:) {0:1:4} in /u01/app/11.2.0/grid/log/nxybj2/agent/crsd/oraagent_grid/oraagent_grid.log. 2024-10-05 14:50:47.437: [/u01/app/11.2.0/grid/bin/orarootagent.bin(14460)]CRS-5822:Agent /u01/app/11.2.0/grid/bin/orarootagent_root disconnected from server. Details at (:CRSAGF00117:) {0:3:2522} in /u01/app/11.2.0/grid/log/nxybj2/agent/crsd/orarootagent_root/orarootagent_root.log. 2024-10-05 14:50:47.691: [cssd(10538)]CRS-1660:The CSS daemon shutdown has completed 2024-10-05 14:50:47.846: [ohasd(10322)]CRS-2765:Resource ora.evmd has failed on server nxybj2. 2024-10-05 14:50:47.848: [ohasd(10322)]CRS-2765:Resource ora.ctssd has failed on server nxybj2. 2024-10-05 14:50:48.452: [crsd(345044)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in /u01/app/11.2.0/grid/log/nxybj2/crsd/crsd.log. 2024-10-05 14:50:48.598: [ohasd(10322)]CRS-2765:Resource ora.cssdmonitor has failed on server nxybj2. 2024-10-05 14:50:48.603: [ohasd(10322)]CRS-2765:Resource ora.crsd has failed on server nxybj2. 2024-10-05 14:50:48.861: [ctssd(348486)]CRS-2402:The Cluster Time Synchronization Service aborted on host nxybj2. Details at (:ctss_css_init1:) in /u01/app/11.2.0/grid/log/nxybj2/ctssd/octssd.log. 2024-10-05 14:50:49.102: [ohasd(10322)]CRS-2765:Resource ora.cssd has failed on server nxybj2. 2024-10-05 14:50:49.105: [/u01/app/11.2.0/grid/bin/oraagent.bin(10448)]CRS-5011:Check of resource ASM failed: details at (:CLSN00006:) in /u01/app/11.2.0/grid/log/nxybj2/agent/ohasd/oraagent_grid/oraagent_grid.log 2024-10-05 14:50:49.271: [/u01/app/11.2.0/grid/bin/oraagent.bin(10448)]CRS-5011:Check of resource ASM failed: details at (:CLSN00006:) in /u01/app/11.2.0/grid/log/nxybj2/agent/ohasd/oraagent_grid/oraagent_grid.log 2024-10-05 14:50:49.272: [ohasd(10322)]CRS-2765:Resource ora.asm has failed on server nxybj2. 2024-10-05 14:50:51.030: [cssd(349990)]CRS-1713:CSSD daemon is started in clustered mode 2024-10-05 14:50:52.863: [ohasd(10322)]CRS-2878:Failed to restart resource ora.ctssd 2024-10-05 14:50:52.863: [ohasd(10322)]CRS-2769:Unable to failover resource ora.ctssd. 2024-10-05 14:51:06.688: [cssd(349990)]CRS-1707:Lease acquisition for node nxybj2 number 2 completed 2024-10-05 14:51:07.344: [ohasd(10322)]CRS-2765:Resource ora.cluster_interconnect.haip has failed on server nxybj2. 2024-10-05 14:51:07.954: [cssd(349990)]CRS-1605:CSSD voting file is online: /dev/asmdisk/ocr02; details in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log. 2024-10-05 14:51:07.961: [cssd(349990)]CRS-1605:CSSD voting file is online: /dev/asmdisk/ocr03; details in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log. 2024-10-05 14:51:07.968: [cssd(349990)]CRS-1605:CSSD voting file is online: /dev/asmdisk/ocr01; details in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log. 2024-10-05 14:51:13.083: [cssd(349990)]CRS-1601:CSSD Reconfiguration complete. Active nodes are nxybj1 nxybj2 . 2024-10-05 14:51:15.026: [ctssd(351425)]CRS-2403:The Cluster Time Synchronization Service on host nxybj2 is in observer mode. 2024-10-05 14:51:15.261: [ctssd(351425)]CRS-2407:The new Cluster Time Synchronization Service reference node is host nxybj1. 2024-10-05 14:51:15.261: [ctssd(351425)]CRS-2401:The Cluster Time Synchronization Service started on host nxybj2. 2024-10-05 14:51:17.018: [ohasd(10322)]CRS-2767:Resource state recovery not attempted for ora.diskmon as its target state is OFFLINE 2024-10-05 14:51:17.018: [ohasd(10322)]CRS-2769:Unable to failover resource ora.diskmon. 2024-10-05 14:51:22.920: [ohasd(10322)]CRS-2878:Failed to restart resource ora.cluster_interconnect.haip 2024-10-05 14:51:22.920: [/u01/app/11.2.0/grid/bin/oraagent.bin(10448)]CRS-5011:Check of resource ASM failed: details at (:CLSN00006:) in /u01/app/11.2.0/grid/log/nxybj2/agent/ohasd/oraagent_grid/oraagent_grid.log 2024-10-05 14:51:23.116: [/u01/app/11.2.0/grid/bin/oraagent.bin(10448)]CRS-5011:Check of resource ASM failed: details at (:CLSN00006:) in /u01/app/11.2.0/grid/log/nxybj2/agent/ohasd/oraagent_grid/oraagent_grid.log 2024-10-05 14:51:30.099: [ohasd(10322)]CRS-2878:Failed to restart resource ora.asm 2024-10-05 14:51:31.109: [ohasd(10322)]CRS-2878:Failed to restart resource ora.crsd 2024-10-05 14:51:36.711: [crsd(353461)]CRS-1012:The OCR service started on node nxybj2. 2024-10-05 14:51:36.717: [evmd(348478)]CRS-1401:EVMD started on node nxybj2. 2024-10-05 14:51:37.724: [crsd(353461)]CRS-1201:CRSD started on node nxybj2. 2024-10-05 14:51:39.350: node2 ocssd日志 2024-10-05 14:50:31.036: [ CSSD][1568802560]clssnmPollingThread: node nxybj1 (1) at 50% heartbeat fatal, removal in 14.520 seconds 2024-10-05 14:50:31.036: [ CSSD][1568802560]clssnmPollingThread: node nxybj1 (1) is impending reconfig, flag 2491406, misstime 15480 2024-10-05 14:50:31.036: [ CSSD][1568802560]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1) 2024-10-05 14:50:31.037: [ CSSD][1584641792]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592072, LATS 2407806370, lastSeqNo 108585015, uniqueness 1708518327, timestamp 1728111030/2409221770 2024-10-05 14:50:31.037: [ CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592073, LATS 2407806370, lastSeqNo 108589502, uniqueness 1708518327, timestamp 1728111030/2409221810 2024-10-05 14:50:31.687: [ CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2407807020 2024-10-05 14:50:31.777: [ CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2407807110 2024-10-05 14:50:31.817: [ CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2407807150 2024-10-05 14:50:32.037: [ CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592076, LATS 2407807370, lastSeqNo 108592073, uniqueness 1708518327, timestamp 1728111031/2409222810 2024-10-05 14:50:32.197: [ CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407807530 2024-10-05 14:50:32.287: [ CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407807620 2024-10-05 14:50:32.327: [ CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407807660 2024-10-05 14:50:32.480: [ CSSD][1567225600]clssnmSendingThread: sending status msg to all nodes 2024-10-05 14:50:32.480: [ CSSD][1567225600]clssnmSendingThread: sent 5 status msgs to all nodes 2024-10-05 14:50:32.539: [ CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592079, LATS 2407807880, lastSeqNo 108592076, uniqueness 1708518327, timestamp 1728111032/2409223320 2024-10-05 14:50:32.698: [ CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407808040 2024-10-05 14:50:32.789: [ CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407808130 2024-10-05 14:50:32.828: [ CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407808170 2024-10-05 14:50:33.038: [ CSSD][1579894528]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592080, LATS 2407808370, lastSeqNo 108591299, uniqueness 1708518327, timestamp 1728111032/2409223700 2024-10-05 14:50:33.038: [ CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592082, LATS 2407808370, lastSeqNo 108592079, uniqueness 1708518327, timestamp 1728111032/2409223820 2024-10-05 14:50:33.207: [ CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407808540 2024-10-05 14:50:33.297: [ CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407808630 2024-10-05 14:50:33.337: [ CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407808670 2024-10-05 14:50:33.540: [ CSSD][1584641792]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592084, LATS 2407808880, lastSeqNo 108592072, uniqueness 1708518327, timestamp 1728111033/2409224300 2024-10-05 14:50:33.546: [ CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592085, LATS 2407808880, lastSeqNo 108592082, uniqueness 1708518327, timestamp 1728111033/2409224330 2024-10-05 14:50:33.709: [ CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407809050 2024-10-05 14:50:33.799: [ CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407809130 2024-10-05 14:50:33.838: [ CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407809170 2024-10-05 14:50:34.038: [ CSSD][1579894528]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592086, LATS 2407809370, lastSeqNo 108592080, uniqueness 1708518327, timestamp 1728111033/2409224710 2024-10-05 14:50:34.038: [ CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592088, LATS 2407809370, lastSeqNo 108592085, uniqueness 1708518327, timestamp 1728111033/2409224840 2024-10-05 14:50:34.217: [ CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407809550 2024-10-05 14:50:34.307: [ CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407809640 2024-10-05 14:50:34.347: [ CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407809680 2024-10-05 14:50:34.540: [ CSSD][1579894528]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592089, LATS 2407809880, lastSeqNo 108592086, uniqueness 1708518327, timestamp 1728111034/2409225210 2024-10-05 14:50:34.540: [ CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592091, LATS 2407809880, lastSeqNo 108592088, uniqueness 1708518327, timestamp 1728111034/2409225340 2024-10-05 14:50:34.718: [ CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407810050 2024-10-05 14:50:34.809: [ CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407810150 2024-10-05 14:50:34.849: [ CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407810180
通过crs日志可以看出在14:50:31开始出现节点之间通信失败报错随后节点2发生了重启。 3 查看节点2系统日志
节点2系统日志 [rootnxybj2 log]# more messages-20241006 Sep 29 03:50:01 nxybj2 rsyslogd: [origin softwarersyslogd swVersion5.8.10 x-pid7519 x-infohttp://www.rsyslog.com] rsyslogd was HUPed Sep 29 03:50:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up- to-date. Sep 30 03:28:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up- to-date. Oct 1 04:16:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up- to-date. Oct 2 03:24:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up- to-date. Oct 3 03:23:03 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up- to-date. Oct 4 03:27:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up- to-date. Oct 5 03:34:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up- to-date. Oct 5 14:50:46 nxybj2 ntpd[394054]: Deleting interface #4 bond0:1, 172.31.122.37#123, interface stats: received0, sent0, dropped0, active_time6191372 secs Oct 5 14:51:15 nxybj2 ntpd[394054]: Deleting interface #6 bond1:1, 169.254.209.169#123, interface stats: received0, sent0, dropped0, active_time6191401 secs Oct 5 14:51:23 nxybj2 ntpd[394054]: Listen normally on 17 bond1:1 169.254.209.169 UDP 123 Oct 5 14:51:46 nxybj2 ntpd[394054]: Listen normally on 18 bond0:1 172.31.122.37 UDP 123 系统日志无异常报错。
三、故障原因分析
根据日志信息可以看到是由于节点之间心跳通信异常导致节点2发生了重启根据日志中的关键信息
2024-10-05 14:50:45.484: [cssd(10595)]CRS-1607:Node nxybj2 is being evicted in cluster incarnation 587229053; details at (:CSSNM00007:) in /u01/app/11.2.0/grid/log/nxybj1/cssd/ocssd.log. 2024-10-05 14:50:47.994: [cssd(10595)]CRS-1625:Node nxybj2, number 2, was manually shut down
在mos上查询该报错通过查询到的结果分析该报错为bug 受影响版本为11.2.0.3当前生成环境版本为11.2.0.4由此可以确定本次节点2重启是bug导致。
进一步查询该bug在如下版本中进行了修复 根据查询结果要想修复此bug就需要打对应版本的补丁。