Sterling OMS Health Monitor

Sterling OMS Health Monitor

In this post we are going to see how Sterling OMS uses health monitor process.

Database Table name	YFS_HEARTBEAT
Command To start health Monitor	startHealthMonitor.cmd

For application health monitor and reporting we are having various tools in market. Here is few mainly used tools for health monitor.

Anturis
Dynatrace
AppDynamics
TraceView
Boundary

Now the next question comes to our mind; Why we need to have health monitor in OMS when lot of tools available in market ?

Sterling OMS health Monitor process, mainly implemented for communication between the OMS servers (Application/Agent/Integration). When cache is cleared by user using Clear Cache button (System — Launch System Console — Clear Cache), System needs to find the all the running/active servers and inform them about clear cache.

Before getting into details lets first understand the columns and use of YFS_HEARTBEAT table.

When any server (Application / Agent / Integration) server started an entry been made into YFS_HEARTBEAT with status code “00” (Running).

Column Name	Data Type	Description
HEARTBEAT_KEY	Char (24)	The primary key for the YFS_HEARTBEAT table.
LAST_HEARTBEAT	DateTime	The timestamp of the last heartbeat.
SERVICE_NAME	Varchar2 (100)	The service, agent or component that collects and stores the statistics.
SERVER_NAME	Varchar2 (100)	A unique name to identify a server
SERVER_TYPE	Varchar2 (40)	The type of the server. For example, the server type can be AGENT, INTEGRATION or APPSERVER.
SERVER_ID	Varchar2 (100)	The identifier associated with the server.
STATUS	Varchar2 (40)	The status associated with this server. The valid values are: • 00: RUNNING • 01: STOPPED • 02: TERMINATE
THREADS_CONFIGURED	Number (5,0)	The number of threads configured for the server.
ACTIVE_THREADS	Number (5,0)	The number of active threads in the server.
HOST_NAME	Varchar2 (100)	The host name on which the server is running.
SERVER_START_TIME	DateTime	The time stamp of the agent server start time.
PERCENT_CACHE_USED	Number (15,2)	The percentage of cache used.
RMI_OBJECT	BLOB	Rmi object for the agent servers

In this table we have few important columns

server start time : When exactly server started
last heart beat : Time when server communicated the status back to health monitor
Host name : Where the server started exactly
Server type : AGENT, INTEGRATION or APPSERVER
RMI_OJBECT :

RMI ?

Remote method Invocation. yes you read this correctly. Sterling uses RMI calls to communicate between servers.

The RMI (Remote Method Invocation) is an API that provides a mechanism to create distributed application in java. The RMI allows an object to invoke methods on an object running in another JVM.

Click here to read more about Java RMI

How Entry made into YFS_HEARTBEAT table ?

yfs.properties configuration related to Health Monitor?

Properties Name	Description
rmi.portrange	In a deployment with servers in two different network zones, The firewall between them must be configured to allow Remote Method Invocation (RMI) Communication between them.
yantra.hm.purge.interval	Health monitor purge interval in days. System default value used for purging heartbeat, Snapshot, and page cache records. If this value is not specified, the default value is 30 days.
yantra.statistics.persist.interval	Property to determine statistics logging time interval. Valid values for minutes (M/m) = 1, 2, 3, 4, 5, 6, 10, 12, 15, 20, or 30 Valid values for hours (H/h) = 1, 2, 3, 4, 6, 8, or 12 Default = 10m
yfs.heartbeat.refresh.interval	Valid values for minutes (M/m) = 1, 2, 3, 4, 5, 6, 10, 12, 15, 20, or 30 Valid values for minutes (H/h) = 1, 2, 3, 4, 6, 8, or 12 Default = 10m

So here is the important point

By default, that refresh interval is set to yantra.statistics.persist.interval / 2

YFS_HEARTBEAT table record entry and update

As soon as server started record insert into yfs_heartbeat table with status code 00 (active)
Based on refresh interval parameter (every 5 minutes) last_heartbeat column been updated

Clear Cache Process

How does startHealthMonitor.cmd works ?

#	Operation	Query	Remarks
1	Delete record from YFS_HEARTBEAT Table where status not active (00) and MODIFYTS date 30 days older	Delete from YFS_HEARTBEAT Where STATUS != ’00’ AND MODIFYTS < {ts ‘2017-08-31 01:24:51’}	Here 30 considered from yantra.hm.purge.interval
2	Delete record from YFS_SNAPSHOT Where MODIFYTS date 30 days older	Delete from YFS_SNAPSHOT Where MODIFYTS < {ts ‘2017-08-31 01:24:52’}	Here 30 considered from yantra.hm.purge.interval
3	Delete record from PLT_PAGED_DATA table Where last_accessed date 30 days older	DELETE /YANTRA/ FROM PLT_PAGED_DATA WHERE LAST_ACCESSED < ?	Here 30 considered from yantra.hm.purge.interval
4	Select record from heartbeat table with status As 00 but last heart beat record not having Update for past N minutes (10)	SELECT /YANTRA/ YFS_HEARTBEAT.* FROM YFS_HEARTBEAT YFS_HEARTBEAT WHERE STATUS = ’00’ AND LAST_HEARTBEAT < 2017-09-30T01:23:03	Considered yantra.statistics.persist.interval(10 min)Current time : 2017-09-30 01:33:03 LAST_HEARTBEAT < Current time – 10 minutes
5	from previous query result get each heart beat Key and do select for update	SELECT /YANTRA/ YFS_HEARTBEAT.* FROM YFS_HEARTBEAT YFS_HEARTBEAT WHERE (YFS_HEARTBEAT.HEARTBEAT_KEY = ‘2017083118085025272’) FOR UPDATE NOWAIT
6	Update status as stopped (02) for the Selected heart beat key	update /YANTRA/ YFS_HEARTBEAT set STATUS = ’02’,MODIFYUSERID = ‘HM’, MODIFYPROGID = ‘HM’, MODIFYTS = {ts ‘2017-09-30 01:33:22’}, LOCKID=170 WHERE LOCKID = ? AND HEARTBEAT_KEY= ?

Above steps helps to maintain the active records in YFS_HEARTBEAT table.

Questions

1.What will happen if we stop (Control + C in windows command prompt) the agent/application server ?

Answer : Status record will be updated with 01 (Stopped); next time when health monitor picks this record gets deleted

2.What will happen if we kill the server (agent/application) server ?

Answer : Record will say in status 00 (active); Health monitor agent finds last heart beat record not updated for some time; so the change will be changed to 01 and later gets removed

3.Will be able to trigger email when server terminated unexpectedly ?

Answer : See below configuration

3.Can we use other monitoring tools and stop using OMS Health Monitor ?

No; If records not cleaned in YFS_HEARTBEAT table; Too many stale entry cause slowness in process. OMS Health monitor should be enabled and used for effective internal communications. We can use other monitor tools for CPU usage, desk space and server up and running.

4.How to change thresholds for Application server, api, agent server ?

Application Server: yantra.hm.appserver.threshold (yfs.properties)
API: yantra.hm.api.threshold (yfs.properties)
Agent/Integration Server: yantra.hm.agent.threshold (yfs.properties)

Additionally you can modify them from System management console as well.

Please share your feedback on this post. If you have any query please comment below or email as directly at support@activekite.com.

Happy Learning !!!!

Please register with us to get more OMS learning updates.

Click here to read OMS Interview Questions

16 thoughts on “Sterling OMS Health Monitor”

nimesh_nagar 10/01/2017

Well written and extremely useful. Thanks!

Reply ↓
Rajendra Shihare 10/09/2017

Good Explanation! Keep it up!

Reply ↓
ravi 10/09/2017

Very Nice explanation

Reply ↓
ankusharora1990 11/10/2017

Good One

Reply ↓
1. admin Post author11/11/2017
  
  Thanks Ankush !!! We look forward to give more post.
  
  Reply ↓
mnshaikna 01/14/2018

Hello admin,
Nice and useful Post – Thanks.
But have a small doubt.

“How to change thresholds for Application server, api, agent server ?”
What thresholds do you mean here?? Please help..

Reply ↓
1. admin Post author01/23/2018
  
  Threshold can be changed from System — Launch System Management click on the server image under application hosts section or agent/integration server group. Threshold is nothing but average response time or number of tasks can be processed in given time.
  
  For example admin server can have threshold of 0.20 sec
  Agent Servers can have 10,000 tasks as threshold
  Adjust Inventory API can have threshold of 8.0 sec
  
  Hope this helps
  
  Reply ↓
ravi 01/18/2018

how do we know about which cache table which application are using ?

Reply ↓
1. admin Post author01/23/2018
  
  Not directly. We got to know the experience. Which API uses which tables we can get to know from the API document. But that does not have all the cached table information. Cache table information can be found via logs.
  
  Reply ↓
  1. Satheeshkumar Thangaraj 03/28/2019
    
    There is an alternative way to this I suppose.
    dbClassCache.properties file has all the tables enabled for caching.
    Any custom tables (who results need to be cached) can be enabled by including the dbClassCache related properties for the custom tables in customer_overrides.properties.
    
    Something like this..
    
    dbclassCache..enabled=true
    CUSTOM_TABLE_NAME.class=com.yantra.shared.dbclasses.DBCacheHome
    
    Reply ↓
Praveen 03/24/2019

Here, We need to create HMAlert user for the health monitor agent, but my question is to know whether the user should be an Active on Inactive? with LDAP or without LDAP integration to make it work?

Thanks & Regards

Reply ↓
Oliver 02/06/2020

Can i change number of agent threads dynamically?

Reply ↓
1. admin Post author04/29/2020
  
  We are not aware if the thread count can be changed dynamically. Others pls help if you know
  
  Reply ↓
2. Prveen Reddy 01/27/2024
  
  yes you can do,
  
  But you need to stop and start server , every time you change thereads
  
  Reply ↓
Kumar 08/18/2020

If we delete the yfs.hearbeat table from sql, while app/agent/intigration services are in running. Dose the yfs.heartbeat table update its self on next heart beat ? Or app/agent/ services has to be restarted ? Or any other way to update table without stopping/starting services?

Reply ↓
1. admin Post author08/28/2020
  
  This is good question. As per our understanding if we delete next time new record will be created. I am not sure we have option to update with API (Assuming not). Because heartbeat is internal to OMS. Don’t see business case to update the heartbeat time manually.
  
  Reply ↓