{"id":3025,"date":"2011-07-12T14:22:13","date_gmt":"2011-07-12T12:22:13","guid":{"rendered":"https:\/\/ingmarverheij.com\/2011\/07\/failed-heartbeat-unnoticed-in-distributed-application\/"},"modified":"2011-07-12T14:22:13","modified_gmt":"2011-07-12T12:22:13","slug":"failed-heartbeat-unnoticed-in-distributed-application","status":"publish","type":"post","link":"https:\/\/ingmarverheij.com\/en\/failed-heartbeat-unnoticed-in-distributed-application\/","title":{"rendered":"Failed heartbeat unnoticed in Distributed Application"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; border-bottom: 0px; border-left: 0px; margin: 0px 5px 0px 0px; padding-left: 0px; padding-right: 0px; display: inline; float: left; border-top: 0px; border-right: 0px; padding-top: 0px\" title=\"Server down\" border=\"0\" alt=\"Server down\" align=\"left\" src=\"https:\/\/ingmarverheij.com\/wp-content\/uploads\/2011\/07\/Server-down.jpg\" width=\"95\" height=\"70\" \/><\/p>\n<p>System Center Operations Manager (SCOM) monitors the health of systems with an agent. One of the most basic checks that is executed is a <strong>health check<\/strong> of the agent itself. One of the checks is a <strong>heartbeat<\/strong> between the agent and the RMS (Root Management Server). If the heartbeat is lost for three times (configurable), the agent is considered <strong>unavailable<\/strong>.<a href=\"https:\/\/ingmarverheij.com\/wp-content\/uploads\/2011\/07\/Health-Service-Heartbeat-Failure.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; border-right-width: 0px; margin: 0px 0px 0px 5px; padding-left: 0px; padding-right: 0px; display: inline; float: right; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px\" title=\"Health Service Heartbeat Failure\" border=\"0\" alt=\"Health Service Heartbeat Failure\" align=\"right\" src=\"https:\/\/ingmarverheij.com\/wp-content\/uploads\/2011\/07\/Health-Service-Heartbeat-Failure_thumb.png\" width=\"119\" height=\"131\" \/><\/a><\/p>\n<p>An <strong>alert<\/strong> is generated and (if configured) a notification is send to inform the administrator that there is a problem.<\/p>\n<p>But if a <a href=\"https:\/\/technet.microsoft.com\/en-us\/library\/dd440870.aspx\" target=\"_blank\">Distributed Application<\/a> is configured to monitor a <strong>chain<\/strong> of components, this failure remains unnoticed. <\/p>\n<p><a href=\"https:\/\/ingmarverheij.com\/wp-content\/uploads\/2011\/07\/State.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; border-right-width: 0px; margin: 0px 5px 0px 0px; padding-left: 0px; padding-right: 0px; display: inline; float: left; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px\" title=\"Node state &#39;Healthy&#39;\" border=\"0\" alt=\"Node state &#39;Healthy&#39;\" align=\"left\" src=\"https:\/\/ingmarverheij.com\/wp-content\/uploads\/2011\/07\/State_thumb.png\" width=\"129\" height=\"50\" \/><\/a><\/p>\n<p>Nodes that are unmonitored are <strong>grey<\/strong> and appear to be \u2018<strong>Healthy&#8217;<\/strong>, which is strange for a node who\u2019s heartbeat haven\u2019t reported for quite some time.<\/p>\n<p><!--more--><\/p>\n<h4>Unnoticed heartbeat failure<\/h4>\n<p>Operations Manager assumes that if a node is <strong>unavailable<\/strong> because the <strong>heartbeat<\/strong> is lost, no <strong>child<\/strong> objects should be monitored. This is good to <strong>prevent<\/strong> alerts of child objects which are probably as unavailable as the parent, but sets the whole node in an&#160; \u2018<strong>unmonitored\u2019<\/strong> state.&#160; <\/p>\n<p><a href=\"https:\/\/ingmarverheij.com\/wp-content\/uploads\/2011\/07\/Bad.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; border-bottom: 0px; border-left: 0px; margin: 0px 5px 0px 0px; padding-left: 0px; padding-right: 0px; display: inline; float: left; border-top: 0px; border-right: 0px; padding-top: 0px\" title=\"Distributed application state &#39;Okay&#39;\" border=\"0\" alt=\"Distributed application state &#39;Okay&#39;\" align=\"left\" src=\"https:\/\/ingmarverheij.com\/wp-content\/uploads\/2011\/07\/Bad_thumb.png\" width=\"129\" height=\"122\" \/><\/a><\/p>\n<p>The effect of putting a node in an <strong>&#8216;unmonitored<\/strong> state\u2019 is that a parent node in a distributed application, containing one or more agents, <strong>doesn\u2019t<\/strong> check the health of the machine. So, in other words, if the heartbeat is lost the parent nodes still reports it to be <strong>Okay<\/strong>.<a href=\"https:\/\/ingmarverheij.com\/wp-content\/uploads\/2011\/07\/Good.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; border-right-width: 0px; margin: 0px 0px 0px 5px; padding-left: 0px; padding-right: 0px; display: inline; float: right; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px\" title=\"Distributed application state &#39;Error&#39;\" border=\"0\" alt=\"Distributed application state &#39;Error&#39;\" align=\"right\" src=\"https:\/\/ingmarverheij.com\/wp-content\/uploads\/2011\/07\/Good_thumb.png\" width=\"129\" height=\"112\" \/><\/a><\/p>\n<p>To <strong>prevent<\/strong> a parent node (containing one or more agents) to stop monitoring the health child nodes when they are unmonitored an <strong>override<\/strong> can be configured. With the override the state of an unmonitored node can be <strong>configured<\/strong> to result in a <strong>warning<\/strong> or an <strong>error<\/strong>.<\/p>\n<h4>&#160;<\/h4>\n<h4>Configure override<\/h4>\n<p>The override should be configured in the node that contains the agents. As an example I\u2019ve created a Distributed Application with the name \u2018Test\u2019 that contains a node \u2018Application Servers\u2019. This node contains two agents : VCTX101 and VCTX110. <\/p>\n<p><a href=\"https:\/\/ingmarverheij.com\/wp-content\/uploads\/2011\/07\/Distributed-Application.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; border-bottom: 0px; border-left: 0px; margin: 0px 5px 0px 0px; padding-left: 0px; padding-right: 0px; display: inline; border-top: 0px; border-right: 0px; padding-top: 0px\" title=\"Distributed Application\" border=\"0\" alt=\"Distributed Application\" src=\"https:\/\/ingmarverheij.com\/wp-content\/uploads\/2011\/07\/Distributed-Application_thumb.png\" width=\"554\" height=\"407\" \/><\/a><\/p>\n<p>Select the node and click \u2018<strong>Configure Health Rollup\u2019<\/strong>, here you can configure <strong>overrides<\/strong> for the node.<\/p>\n<p>On the bottom you will an override for the monitor \u2018Monitoring unavailable\u2019. The <strong>default<\/strong> option is \u2018Monitoring Unavailable\u2019 and would <strong>prevent<\/strong> an unmonitored node to affect the state of the node. By <strong>enabling<\/strong> an override and setting the value to \u2018Rollup monitoring unavailable to <strong>error\u2019<\/strong> an unmonitored node will place the node in an <strong>error<\/strong> <strong>state<\/strong>.<\/p>\n<p>&#160;<\/p>\n<p><a href=\"https:\/\/ingmarverheij.com\/wp-content\/uploads\/2011\/07\/Override-Monitoring-Unavailable-Monitoring-Unavailable.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; border-bottom: 0px; border-left: 0px; margin: 0px 5px 0px 0px; padding-left: 0px; padding-right: 0px; display: inline; float: left; border-top: 0px; border-right: 0px; padding-top: 0px\" title=\"Override - Monitoring Unavailable - Monitoring Unavailable\" border=\"0\" alt=\"Override - Monitoring Unavailable - Monitoring Unavailable\" align=\"left\" src=\"https:\/\/ingmarverheij.com\/wp-content\/uploads\/2011\/07\/Override-Monitoring-Unavailable-Monitoring-Unavailable_thumb.png\" width=\"279\" height=\"193\" \/><\/a><\/p>\n<p><a href=\"https:\/\/ingmarverheij.com\/wp-content\/uploads\/2011\/07\/Override-Monitoring-Unavailable-Rollup-monitoring-unavailable-as-error.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; border-bottom: 0px; border-left: 0px; margin: 0px 5px 0px 0px; padding-left: 0px; padding-right: 0px; display: inline; float: left; border-top: 0px; border-right: 0px; padding-top: 0px\" title=\"Override - Monitoring Unavailable - Rollup monitoring unavailable as error\" border=\"0\" alt=\"Override - Monitoring Unavailable - Rollup monitoring unavailable as error\" align=\"left\" src=\"https:\/\/ingmarverheij.com\/wp-content\/uploads\/2011\/07\/Override-Monitoring-Unavailable-Rollup-monitoring-unavailable-as-error_thumb.png\" width=\"279\" height=\"193\" \/><\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>System Center Operations Manager (SCOM) monitors the health of systems with an agent. One of the most basic checks that is executed is a health check of the agent itself. One of the checks is a heartbeat between the agent and the RMS (Root Management Server). If the heartbeat is lost for three times (configurable), [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"site-container-style":"default","site-container-layout":"default","site-sidebar-layout":"default","disable-article-header":"default","disable-site-header":"default","disable-site-footer":"default","disable-content-area-spacing":"default","footnotes":""},"categories":[309],"tags":[319,360,361,235],"class_list":["post-3025","post","type-post","status-publish","format-standard","hentry","category-monitoring-2","tag-distributed-application","tag-heartbeat","tag-override","tag-scom"],"_links":{"self":[{"href":"https:\/\/ingmarverheij.com\/en\/wp-json\/wp\/v2\/posts\/3025","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ingmarverheij.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ingmarverheij.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ingmarverheij.com\/en\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/ingmarverheij.com\/en\/wp-json\/wp\/v2\/comments?post=3025"}],"version-history":[{"count":0,"href":"https:\/\/ingmarverheij.com\/en\/wp-json\/wp\/v2\/posts\/3025\/revisions"}],"wp:attachment":[{"href":"https:\/\/ingmarverheij.com\/en\/wp-json\/wp\/v2\/media?parent=3025"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ingmarverheij.com\/en\/wp-json\/wp\/v2\/categories?post=3025"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ingmarverheij.com\/en\/wp-json\/wp\/v2\/tags?post=3025"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}