IBM Support

PM74855: HUNG ENDPOINT SERVER CAUSES WSGRID FUNCTION IN COMPUTE GRID JOB SCHEDULER SERVER TO STOP SENDING OUTPUT.

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • Customer was running Compute Grid 8.0 in their production
    environment. The jobs in this environment are triggered via
    WSGrid, and they bserved that several jobs had active WSGrid
    sessions that  were not reflected as jobs in the JMC. There was
    only one job was in the executing state in the JMC, but its
    joblog indicated that it should have been in a different  state.
    Once they cycled the endpoint appserver that  this job had run
    on, at which point the normal flow of jobs through the
    environment via WSGrid resumed.
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED:  All users of WebSphere Extended Deployment  *
    *                  Compute Grid Version 8.                     *
    ****************************************************************
    * PROBLEM DESCRIPTION: Job log streaming for jobs submitted    *
    *                      via WSGrid (e.g. via an external        *
    *                      scheduler) appears to stop due to a     *
    *                      hang or a slowdown on an endpoint       *
    *                      server executing already-submitted      *
    *                      (via WSGrid) jobs.                      *
    ****************************************************************
    * RECOMMENDATION:                                              *
    ****************************************************************
    The problem can happen when an endpoint server executing
    jobs dispatched via the WSGrid interface experiences a
    slowdown, e.g. because it is thrashing en route to running out
    of memory,
    If an endpoint slows down enough, it might not respond to
    the scheduler's requests to receive job log updates and status
    updates and send them back to the WSGrid client (e.g. the
    external scheduler).
    The scheduler threads may hang for a long time, waiting for the
    endpoint's response.  Since there is only a single thread pool
    in the scheduler used to stream the output from all endpoints,
    this can lead to the situation where there is no output being
    received over the WSGrid interface at all, (since all the
    relevant scheduler threads are hung waiting for output from a
    single bad server).
    However, the jobs submitted to the other (good) endpoints
    should
    still have executed normally in this scenario, although the
    output is not handled properly and sent back to the WSGrid
    client.
    

Problem conclusion

  • The scheduler threads streaming output from the endpoint server
    for WSGrid-submitted jobs back to the WSGrid client will now
    timeout rather than hanging indefinitely.   So a single bad
    endpoint can slow down output streaming, but only in proportion
    to the number of jobs on these endpoints compared to the total
    jobs managed by this scheduler, rather than preventing
    streaming of all WSGrid output.
    The fix for this APAR is currently targeted for inclusion in
    fixpack 8.0.0.3. Please refer to the Recommended Updates page
    for delivery information:
    http://www.ibm.com/support/docview.wss?uid=swg27022998
    

Temporary fix

Comments

APAR Information

  • APAR number

    PM74855

  • Reported component name

    WXD COMPUTE GRI

  • Reported component ID

    5725C9301

  • Reported release

    800

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt

  • Submitted date

    2012-10-11

  • Closed date

    2013-01-02

  • Last modified date

    2013-04-19

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    PM75190

Fix information

  • Fixed component name

    WXD COMPUTE GRI

  • Fixed component ID

    5725C9301

Applicable component levels

  • R800 PSY

       UP

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSFVRM","label":"WebSphere Extended Deployment Compute Grid"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"8.0","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
29 October 2021