protection
Section: overload
Default Value: false
Valid Values: true, false
Changes Take Effect: Immediately
Introduced: 8.5.108
Controls whether the overload protection is applied during the Stat Server overload.
qos-default-overload-policy
Section: overload
Default Value: 0
Valid Values: 0, 1, 2
Changes Take Effect: After restart
Introduced: 8.5.108
Defines the global overload policy.
If this option is set to:
- 0 (zero) - sends and updates for requested statistics can be cut
- 1 - only sends of statistics to Stat Server clients can be cut
- 2 - nothing can be cut. Stat Server updates and sends all requested statistics.
qos-recovery-enable-lms-messages
Section: overload
Default Value: false
Valid Values: true, false
Changes Take Effect: After restart
Introduced: 8.5.108
Enables Standard recovery related log messages, which are introduced for debugging purpose:
10072 “GCTI_SS_OVERLOAD_RECOVERY_STARTED - Overload recovery started on %s (%d current CPU usage)”
10073 “GCTI_SS_OVERLOAD_RECOVERY_FAILED - Overload recovery failed on %s (%d current CPU usage)”.
qos-default-overload-policy
Section: overload
Default Value: 0
Valid Values: 0, 1, 2
Changes Take Effect: After restart
Introduced: 8.5.108
Defines the global overload policy.
If this option is set to:
- 0 (zero) - sends and updates for requested statistics can be cut
- 1 - only sends of statistics to Stat Server clients can be cut
- 2 - nothing can be cut. Stat Server updates and sends all requested statistics.
protection
Section: overload
Default Value: false
Valid Values: true, false
Changes Take Effect: Immediately
Introduced: 8.5.108
Controls whether the overload protection is applied during the Stat Server overload.
cut-debug-log
Section: overload
Default Value: true
Valid Values: true, false
Changes Take Effect: Immediately
Introduced: 8.5.108
Controls debug logging in the overload. If set to true, the debug log is cut during the Stat Server overload.
cpu-threshold-low
Section: overload
Default Value: 60
Valid Values: 0-100
Changes Take Effect: After restart
Introduced: 8.5.108
Defines the lower level of the main thread CPU utilization threshold, which signifies the start of the Stat Server recovery.
cpu-threshold-high
Section: overload
Default Value: 80
Valid Values: 0-100
Changes Take Effect: After restart
Introduced: 8.5.108
Defines the higher level of the main thread CPU utilization threshold, which signifies the start of the Stat Server overload.
cpu-poll-timeout
Section: overload
Default Value: 10
Valid Values: 1-60
Changes Take Effect: After restart
Introduced: 8.5.108
Defines, in seconds, how often the main thread CPU is polled.
cpu-cooldown-cycles
Section: overload
Default Value: 30
Valid Values: 1-100
Changes Take Effect: After restart
Introduced: 8.5.108
Defines the number of cpu-poll-timeout cycles in a cooldown period.
For example, if the cpu-poll-timeout = 10sec and cpu-cooldown-cycles = 30, then the cooldown period is 10x30 =300sec. It means that the main thread CPU should be below the value of the cpu-threshold-low option for 300sec, after this period overload recovery is considered to be over.
allow-new-requests-during-overload
Section: overload
Default Value: true
Valid Values: true, false
Changes Take Effect: Immediately
Introduced: 8.5.108
Controls whether new requests can be made during the Stat Server overload.
allow-new-connections-during-overload
Section: overload
Default Value: true
Valid Values: true, false
Changes Take Effect: Immediately
Introduced: 8.5.108
Controls whether new clients can connect during the Stat Server overload.
cut-debug-log
Section: overload
Default Value: true
Valid Values: true, false
Changes Take Effect: Immediately
Introduced: 8.5.108
Controls debug logging in the overload. If set to true, the debug log is cut during the Stat Server overload.
Overload Protection
Starting with release 8.5.108, Stat Server supports overload protection.
Introduction
When and why to use overload protection?
The number of opened statistics depends on the client demands. The more statistics are opened or the more incoming events are received, the higher Stat Server CPU consumption. Stat Server application is not scalable and, in certain circumstances, it may start behaving unreliably (disconnect clients, get disconnected from servers, delay computations).
Stat Server load is %CPU, consumed by its main thread. It depends on the rate of incoming events and number (and parameters) of open statistics. Overload protection is a method of reducing CPU consumption as a response to Stat Server overload. The load range is defined as [min,max]. The cooldown is a predefined duration of time, when the load is less then min. Stat Server is in overload, if the load exceeded max, and no cooldown happened since then. Stat Server is in recovery, if it is in overload, and current load is less then min.
Overload protection consists of the following load reducing measures:
- Measure 1. Cut debug logging, controlled by the settings of the cut-debug-log option.
- Measure 2. Stat Server cannot skip incoming events and always processes them. However, it can lower the quality of service for some statistics in order to reduce CPU consumption. Also, it can skip some operations in the pipeline above: for some statistics, it may stop recalculating values and sending them.
- Measure 3. For some statistics Stat Server may stop updating aggregate. Please note, that measure 3 includes measure 2.
As soon as Stat Server hits the predefined high CPU threshold, it enters the state of overload. To leave that state, CPU should remain below predefined low threshold for predefined cooldown period.
The goal of the overload protection is to skip minimal amount of operations of statistical sends and updates to reduce CPU consumption to the acceptable level.
- CurrentTargetState
- CurrentState
- CurrentStateReasons
Configuration Options
The following new configuration options are added to Stat Server starting with release 8.5.108:
Option | Summary |
---|---|
allow-new-connections-during-overload | Allows new clients to connect during overload. |
allow-new-requests-during-overload | Allows opening new statistics during overload. |
cpu-cooldown-cycles | The number of cpu-pool-timeout cycles in a cooldown period (Cooldown period / cpu-poll-timeout). |
cpu-poll-timeout | Timeout of polling main thread CPU, in seconds. |
cpu-threshold-high | The higher boundary of the load range. |
cpu-threshold-low | The lower boundary of the CPU range. |
cut-debug-log | Controls the debug log in overload. |
protection | Enables/disables protection. |
qos-default-overload-policy | Default overload policy. |
qos-recovery-enable-lms-messages | Enables recovery-related LMS messages. |
The above options are configured in the [overload] section of the Stat Server application.
The overload policy may vary from statistic to statistic, depending on the end-user preferences. The default overload policy, defined by the qos-default-overload-policy option settings, can be overridden on the stat type level by the DynamicOverloadPolicy option in the [<stat type>] section:
Option | Values | Description |
---|---|---|
DynamicOverloadPolicy |
|
Defines actions that Stat Server may apply to a given statistic to reduce the overload |
LMS Messages
New LMS messages, associated with overload protection, are listed below:
- 10070|STANDARD|GCTI_SS_OVERLOAD_DETECT|Overload detected on %s (%d current CPU usage)
- 10071|STANDARD|GCTI_SS_OVERLOAD_END|Overload ended on %s (%d current CPU usage)
- 10072|STANDARD|GCTI_SS_OVERLOAD_RECOVERY_STARTED|Overload recovery started on %s (%d current CPU usage)
- 10073|STANDARD|GCTI_SS_OVERLOAD_RECOVERY_FAILED|Overload recovery failed on %s (%d current CPU usage)
- 10074|STANDARD|GCTI_SS_OVERLOAD_PROTECTION_ACTIVATED|Overload protection on %s activated
- 10075|STANDARD|GCTI_SS_OVERLOAD_PROTECTION_DEACTIVATED|Overload protection on %s deactivated
- Messages 10070 and 10071 are recommended for operations monitoring.
- Messages 10072 and 10073 are for debugging purposes only, they are disabled by default.
- Messages 10074 and 10075 are generated when the protection configuration option changes its value (or at startup). We need this information in the standard log because the debug logging is cut, when Stat Server is in overload. These messages are for troubleshooting only.
See also Stat Server Deployment Guide for more information on LMS messages.
Performance Counters
The following table includes new performance counters:
Counter | Description |
---|---|
cpu | Main-thread CPU percentage (% of single processor) |
pcpu | Process CPU percentage (% of total) |
shc | stats hit count |
shcs | stats hit count suppressed |
clens | client events not sent |
opc | overload periods count |
opd | overload periods duration sec |
osn | overload stats normal |
osns | overload stats not sent |
osnu | overload stats not updated |