Sterling OMS Health Monitor
Sterling OMS Health Monitor
In this post we are going to see how Sterling OMS uses health monitor process.
Database Table name | YFS_HEARTBEAT |
Command To start health Monitor | startHealthMonitor.cmd |
For application health monitor and reporting we are having various tools in market. Here is few mainly used tools for health monitor.
- Anturis
- Dynatrace
- AppDynamics
- TraceView
- Boundary
Now the next question comes to our mind; Why we need to have health monitor in OMS when lot of tools available in market ?
Sterling OMS health Monitor process, mainly implemented for communication between the OMS servers (Application/Agent/Integration). When cache is cleared by user using Clear Cache button (System — Launch System Console — Clear Cache), System needs to find the all the running/active servers and inform them about clear cache.
Before getting into details lets first understand the columns and use of YFS_HEARTBEAT table.
When any server (Application / Agent / Integration) server started an entry been made into YFS_HEARTBEAT with status code “00” (Running).
Column Name | Data Type | Description |
HEARTBEAT_KEY | Char (24) | The primary key for the YFS_HEARTBEAT table. |
LAST_HEARTBEAT | DateTime | The timestamp of the last heartbeat. |
SERVICE_NAME | Varchar2 (100) | The service, agent or component that collects and stores the statistics. |
SERVER_NAME | Varchar2 (100) | A unique name to identify a server |
SERVER_TYPE | Varchar2 (40) | The type of the server. For example, the server type can be AGENT, INTEGRATION or APPSERVER. |
SERVER_ID | Varchar2 (100) | The identifier associated with the server. |
STATUS | Varchar2 (40) | The status associated with this server. The valid values are: • 00: RUNNING • 01: STOPPED • 02: TERMINATE |
THREADS_CONFIGURED | Number (5,0) | The number of threads configured for the server. |
ACTIVE_THREADS | Number (5,0) | The number of active threads in the server. |
HOST_NAME | Varchar2 (100) | The host name on which the server is running. |
SERVER_START_TIME | DateTime | The time stamp of the agent server start time. |
PERCENT_CACHE_USED | Number (15,2) | The percentage of cache used. |
RMI_OBJECT | BLOB | Rmi object for the agent servers |
In this table we have few important columns
- server start time : When exactly server started
- last heart beat : Time when server communicated the status back to health monitor
- Host name : Where the server started exactly
- Server type : AGENT, INTEGRATION or APPSERVER
- RMI_OJBECT :
RMI ?
Remote method Invocation. yes you read this correctly. Sterling uses RMI calls to communicate between servers.
The RMI (Remote Method Invocation) is an API that provides a mechanism to create distributed application in java. The RMI allows an object to invoke methods on an object running in another JVM.
Click here to read more about Java RMI
How Entry made into YFS_HEARTBEAT table ?
yfs.properties configuration related to Health Monitor?
Properties Name | Description |
rmi.portrange | In a deployment with servers in two different network zones, The firewall between them must be configured to allow Remote Method Invocation (RMI) Communication between them. |
yantra.hm.purge.interval | Health monitor purge interval in days. System default value used for purging heartbeat, Snapshot, and page cache records. If this value is not specified, the default value is 30 days. |
yantra.statistics.persist.interval | Property to determine statistics logging time interval. Valid values for minutes (M/m) = 1, 2, 3, 4, 5, 6, 10, 12, 15, 20, or 30 Valid values for hours (H/h) = 1, 2, 3, 4, 6, 8, or 12 Default = 10m |
yfs.heartbeat.refresh.interval | Valid values for minutes (M/m) = 1, 2, 3, 4, 5, 6, 10, 12, 15, 20, or 30 Valid values for minutes (H/h) = 1, 2, 3, 4, 6, 8, or 12 Default = 10m |
So here is the important point
By default, that refresh interval is set to yantra.statistics.persist.interval / 2
YFS_HEARTBEAT table record entry and update
- As soon as server started record insert into yfs_heartbeat table with status code 00 (active)
- Based on refresh interval parameter (every 5 minutes) last_heartbeat column been updated
Clear Cache Process
How does startHealthMonitor.cmd works ?
# | Operation | Query | Remarks |
1 | Delete record from YFS_HEARTBEAT Table where status not active (00) and MODIFYTS date 30 days older |
Delete from YFS_HEARTBEAT Where STATUS != ’00’ AND MODIFYTS < {ts ‘2017-08-31 01:24:51’} |
Here 30 considered from yantra.hm.purge.interval |
2 | Delete record from YFS_SNAPSHOT Where MODIFYTS date 30 days older |
Delete from YFS_SNAPSHOT Where MODIFYTS < {ts ‘2017-08-31 01:24:52’} |
Here 30 considered from yantra.hm.purge.interval |
3 | Delete record from PLT_PAGED_DATA table Where last_accessed date 30 days older |
DELETE /*YANTRA*/ FROM PLT_PAGED_DATA WHERE LAST_ACCESSED < ? |
Here 30 considered from yantra.hm.purge.interval |
4 | Select record from heartbeat table with status As 00 but last heart beat record not having Update for past N minutes (10) |
SELECT /*YANTRA*/ YFS_HEARTBEAT.* FROM YFS_HEARTBEAT YFS_HEARTBEAT WHERE STATUS = ’00’ AND LAST_HEARTBEAT < 2017-09-30T01:23:03 |
Considered yantra.statistics.persist.interval(10 min)Current time : 2017-09-30 01:33:03 LAST_HEARTBEAT < Current time – 10 minutes |
5 | from previous query result get each heart beat Key and do select for update |
SELECT /*YANTRA*/ YFS_HEARTBEAT.* FROM YFS_HEARTBEAT YFS_HEARTBEAT WHERE (YFS_HEARTBEAT.HEARTBEAT_KEY = ‘2017083118085025272’) FOR UPDATE NOWAIT |
|
6 | Update status as stopped (02) for the Selected heart beat key |
update /*YANTRA*/ YFS_HEARTBEAT set STATUS = ’02’,MODIFYUSERID = ‘HM’, MODIFYPROGID = ‘HM’, MODIFYTS = {ts ‘2017-09-30 01:33:22’}, LOCKID=170 WHERE LOCKID = ? AND HEARTBEAT_KEY= ? |
Above steps helps to maintain the active records in YFS_HEARTBEAT table.
Questions
1.What will happen if we stop (Control + C in windows command prompt) the agent/application server ?
Answer : Status record will be updated with 01 (Stopped); next time when health monitor picks this record gets deleted
2.What will happen if we kill the server (agent/application) server ?
Answer : Record will say in status 00 (active); Health monitor agent finds last heart beat record not updated for some time; so the change will be changed to 01 and later gets removed
3.Will be able to trigger email when server terminated unexpectedly ?
Answer : See below configuration
3.Can we use other monitoring tools and stop using OMS Health Monitor ?
No; If records not cleaned in YFS_HEARTBEAT table; Too many stale entry cause slowness in process. OMS Health monitor should be enabled and used for effective internal communications. We can use other monitor tools for CPU usage, desk space and server up and running.
4.How to change thresholds for Application server, api, agent server ?
- Application Server: yantra.hm.appserver.threshold (yfs.properties)
- API: yantra.hm.api.threshold (yfs.properties)
- Agent/Integration Server: yantra.hm.agent.threshold (yfs.properties)
Additionally you can modify them from System management console as well.
Please share your feedback on this post. If you have any query please comment below or email as directly at support@activekite.com.
Happy Learning !!!!
Well written and extremely useful. Thanks!
Good Explanation! Keep it up!
Very Nice explanation
Good One
Thanks Ankush !!! We look forward to give more post.
Hello admin,
Nice and useful Post – Thanks.
But have a small doubt.
“How to change thresholds for Application server, api, agent server ?”
What thresholds do you mean here?? Please help..
Threshold can be changed from System — Launch System Management click on the server image under application hosts section or agent/integration server group. Threshold is nothing but average response time or number of tasks can be processed in given time.
For example admin server can have threshold of 0.20 sec
Agent Servers can have 10,000 tasks as threshold
Adjust Inventory API can have threshold of 8.0 sec
Hope this helps
how do we know about which cache table which application are using ?
Not directly. We got to know the experience. Which API uses which tables we can get to know from the API document. But that does not have all the cached table information. Cache table information can be found via logs.
There is an alternative way to this I suppose.
dbClassCache.properties file has all the tables enabled for caching.
Any custom tables (who results need to be cached) can be enabled by including the dbClassCache related properties for the custom tables in customer_overrides.properties.
Something like this..
dbclassCache..enabled=true
CUSTOM_TABLE_NAME.class=com.yantra.shared.dbclasses.DBCacheHome
Here, We need to create HMAlert user for the health monitor agent, but my question is to know whether the user should be an Active on Inactive? with LDAP or without LDAP integration to make it work?
Thanks & Regards
Can i change number of agent threads dynamically?
We are not aware if the thread count can be changed dynamically. Others pls help if you know
yes you can do,
But you need to stop and start server , every time you change thereads
If we delete the yfs.hearbeat table from sql, while app/agent/intigration services are in running. Dose the yfs.heartbeat table update its self on next heart beat ? Or app/agent/ services has to be restarted ? Or any other way to update table without stopping/starting services?
This is good question. As per our understanding if we delete next time new record will be created. I am not sure we have option to update with API (Assuming not). Because heartbeat is internal to OMS. Don’t see business case to update the heartbeat time manually.