PM64875: ME FAILOVER MAY NOT SUCCEED IF CONNECTION TO DB2 IS DETERMINED BAD.

Fixes are available

APAR status

Closed as program error.

Error description

The WebSphere Applicaiton Server that was running Messaging
Engine (ME) was being brought down.  That caused ME to failover
to another cluster member on a different LPAR which is expected.

However, the adjunct in the 2nd lpar got the errors below and
adjunct was terminated.  To recover, the application server had
to be manually restarted.

J2CA0206W: A connection error occurred.  To help determine the
problem, enable the Diagnose Connection Usage option on the
Connection Factory or Data Source.

J2CA0056I: The Connection Manager received a fatal connection
error from the Resource Adapter for resource
jdbc/<<<resourceName>>>. The exception is:
com.ibm.db2.jcc.am.ClientRerouteException:
[jcc][t4][2027][11212][3.59.83] A connection failed but has been
re-established. The host name or IP address is
"abc.ibm.comt" and the service name or port number is 1,234.
Special registers may or may not be re-attempted (Reason code =
1). ERRORCODE=-30108, SQLSTATE=08506

Followed by FFDC error:

[jcc][t4][2027][11212][3.59.83] A connection failed but has been
re-established. The host name or IP address is
"abc.ibm.com" and the service name or port number is 1,234.
Special registers may or may not be re-attempted (Reason code =
1). ERRORCODE=-30108, SQLSTATE=08506
at com.ibm.db2.jcc.am.dd.a(dd.java:304)
at com.ibm.db2.jcc.am.dd.a(dd.java:356)
at com.ibm.db2.jcc.t4.a.a(a.java:473)
at com.ibm.db2.jcc.t4.a.L(a.java:1024)
at com.ibm.db2.jcc.t4.b.a(b.java:4885)
at com.ibm.db2.jcc.t4.l.bc(l.java:124)
at com.ibm.db2.jcc.am.cn.executeQuery(cn.java:652)
at com.ibm.ws.rsadapter.jdbc.WSJdbcPreparedStatement.pmiExecute
Query
at com.ibm.ws.rsadapter.jdbc.WSJdbcPreparedStatement.executeQue
ry
at com.ibm.ws.sib.msgstore.persistence.impl.MEInnerOwnerTable.
readOwningME
at com.ibm.ws.sib.msgstore.persistence.lock.DBLockingThread.
waitAndRefreshLock
at com.ibm.ws.sib.msgstore.persistence.lock.DBLockingThread.run

The error above is a result of this query:
SELECT ME_UUID,INC_UUID,VERSION,MIGRATION_VERSION FROM
SIBSYS01.SIBOWNER
1003 1007 0 0 2

Finaly, HA Manager killed the JVM bringing Adjunct down:

HMGR0130I: The local member of group <<< group name>>>
has indicated that is it not alive. The JVM will be terminated.
at java.lang.Thread.dumpStack(Thread.java:417)
at com.ibm.ws.hamanager.proxy.DispatchHAGroupCallbackImpl.isAli
ve(DispatchHAGroupCallbackImpl.java:193)
...
Panic:component requested panic from isAlive


In this case, the problem was that the 2nd WebSphere
Application Server created a connection to DB2 datasharing
member that was also being brought down.  That caused DB2 to
return above ClientRerouteException saying that connection was
lost, but it was successfully reconnected to a different
datasharing member.  However, with this property defined:
sib.msgstore.jdbcFailoverOnDBConnectionLoss=true
once there is one failure for connecting to DB2, we will not
retry again and the ME will be brought down.

This apar will provide a property that will make it configurable
to retry the connection (and how many times) before the ME is
brought down.

Local fix

Configure sib.msgstore.jdbcFailoverOnDBConnectionLoss=false
See details on this property here:
http://pic.dhe.ibm.com/infocenter/wasinfo/v7r0/topic/com.ibm.web
sphere.zseries.doc/info/zseries/ae/tjm_dsconnloss.html

Problem summary

****************************************************************
* USERS AFFECTED:  Users of the default messaging provider for *
*                  IBM WebSphere Application Server versions   *
*                  7.0, 8.0, and 8.5                           *
****************************************************************
* PROBLEM DESCRIPTION: In a z O/S LPARs if the Messaging       *
*                      Engine is configured to be running in   *
*                      high availability mode and the DB2      *
*                      which is used as a datastore is also    *
*                      configured to be a clustered setup,     *
*                      when one LPAR is brought down the       *
*                      Messaging Engine on the LPAR would      *
*                      failover onto the other LPAR. But if    *
*                      the connection pool returns a           *
*                      connections that is pointing to a DB2   *
*                      instance running on  the LPAR which     *
*                      was brought down then the Messaging     *
*                      Engine would initiate a local error.    *
*                      If there are only 2 LPARS then the      *
*                      system would be rendered without        *
*                      any Messaging Engine.                   *
****************************************************************
* RECOMMENDATION:                                              *
****************************************************************
In a setup where WebSphere Application server is running on a
z/OS LPAR (active passive)topology and is configured to be in
a high availability mode. If there is a Bus for which Messaging
Engine is configured to run in a high availability mode on the
LPARs with the Database(DB2) also configured to run in a
similar high availability mode on the LPAR. If one the active
LPARs is brought down the Messaging Engine on the LPAR would
failover onto the other LPAR. The first time the Messaging
Engine is coming up it would attempt to obtain a connection
from the the connection pool. The connection pool would return
the connection that would point to the DB2 instance running on
the previous LPAR when attempting to use it, DB2 driver would
issue a "ClientRerouteException". And by default
"ClientRerouteException" is mapped to a
StaleConnectionException. In  thecase of a
StaleConnectionException the Messaging Engine would not
re-attempt the previous operation since the connection is not
guaranteed and would initiate the failover. Since the active
LPAR was already brought down the system is left without any
messaging Engine.

Problem conclusion

After collaborating with the DB2 team we understand that some
of the error codes in the  "ClientRerouteException" would mean
that there is already an instance of DB2 up and running
elsewhere and retrying would connect to a running database with
guarantee. So in the Messaging Engine we will look for the
error codes "-30108,-4499,-4498 " in which case we will
attempt to retry instead of causing a failover.

The fix for this APAR is currently targeted for inclusion in
fix packs 7.0.0.27, 8.0.0.6, and 8.5.0.2.  Please refer to the
Recommended Updates page for delivery information:
http://www.ibm.com/support/docview.wss?rs=180&uid=swg27004980

Temporary fix

Comments

APAR Information

APAR number
PM64875
Reported component name
WAS SIB & SIBWS
Reported component ID
620800101
Reported release
300
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt
Submitted date
2012-05-17
Closed date
2012-10-04
Last modified date
2012-10-04

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

PM93758

Fix information

Fixed component name
WAS SIB & SIBWS
Fixed component ID
620800101

Applicable component levels

R300 PSY
UP
R800 PSY
UP

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSEQTP","label":"WebSphere Application Server"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"7.0","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
28 October 2021

Tips

PM64875: ME FAILOVER MAY NOT SUCCEED IF CONNECTION TO DB2 IS DETERMINED BAD.

Fixes are available

Subscribe

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

R300 PSY

R800 PSY

Document Information

Share your feedback

Need support?