Mistral-api process getting auto-restarted

virenderdubey · August 23, 2018, 1:44pm

We have one mistral server and 8 mistral api processes are running on a host, when i restart the mistral process everything runs fine for some hours.
After 8-10 hours mistral api process start getting restarted automatically with the below error in logs:

[2018-08-23 13:37:20 +0000] [3239] [INFO] Worker exiting (pid: 3239)
[2018-08-23 13:37:21 +0000] [3390] [INFO] Booting worker with pid: 3390
[2018-08-23 13:38:49 +0000] [23257] [CRITICAL] WORKER TIMEOUT (pid:3348)

and either at the time of restarting or just after start of mistral-api all workflows execution failed with error code 1 and below is the stack trace:

2018-08-23 18:39:15.289 23325 ERROR mistral.notifiers.default_notifier [req-052991f8-d388-484e-bd88-73c8f30f6788 - - - - -] Unable to process event for publisher "st2".: Exception: [a34f2cd0-75e2-4af0-9dc3-3d878e846345] Unable to publish event because st2 returned status code 401. {
    "faultstring": "Unauthorized"
}
2018-08-23 18:39:15.289 23325 ERROR mistral.notifiers.default_notifier Traceback (most recent call last):
2018-08-23 18:39:15.289 23325 ERROR mistral.notifiers.default_notifier   File "/opt/stackstorm/mistral/lib/python2.7/site-packages/mistral/notifiers/default_notifier.py", line 39, in notify
2018-08-23 18:39:15.289 23325 ERROR mistral.notifiers.default_notifier     publisher.publish(ex_id, data, event, timestamp, **params)
2018-08-23 18:39:15.289 23325 ERROR mistral.notifiers.default_notifier   File "/opt/stackstorm/mistral/lib/python2.7/site-packages/st2mistral/notifiers/stackstorm_notifier.py", line 297, in publish
2018-08-23 18:39:15.289 23325 ERROR mistral.notifiers.default_notifier     func(ex_id, data, event, timestamp, **kwargs)
2018-08-23 18:39:15.289 23325 ERROR mistral.notifiers.default_notifier   File "/opt/stackstorm/mistral/lib/python2.7/site-packages/st2mistral/notifiers/stackstorm_notifier.py", line 145, in on_workflow_status_update
2018-08-23 18:39:15.289 23325 ERROR mistral.notifiers.default_notifier     'status code %s. %s' % (root_id, resp.status_code, resp.text)
2018-08-23 18:39:15.289 23325 ERROR mistral.notifiers.default_notifier Exception: [a34f2cd0-75e2-4af0-9dc3-3d878e846345] Unable to publish event because st2 returned status code 401. {
2018-08-23 18:39:15.289 23325 ERROR mistral.notifiers.default_notifier     "faultstring": "Unauthorized"
2018-08-23 18:39:15.289 23325 ERROR mistral.notifiers.default_notifier }
2018-08-23 18:39:15.289 23325 ERROR mistral.notifiers.default_notifier

Kindly help to find the cause of this.

lhill · August 24, 2018, 4:34pm

The 401 error makes me think it might be related to authentication token TTLs.

Have you made any changes to any of the defaults around token TTLs?

virenderdubey · August 27, 2018, 5:51am

no, there is no change in authentication TTL Configuration. The only TTL change is for purging old logs and data.


[api]
# Host and port to bind the API server.
host = <ip>
port = 9101
logging = /etc/st2/logging.api.conf
mask_secrets = True
# allow_origin is required for handling CORS in st2 web UI.
# allow_origin = http://myhost1.example.com:3000,http://myhost2.example.com:3000

[stream]
logging = /etc/st2/logging.stream.conf

[sensorcontainer]
logging = /etc/st2/logging.sensorcontainer.conf

[rulesengine]
logging = /etc/st2/logging.rulesengine.conf

[actionrunner]
logging = /etc/st2/logging.actionrunner.conf
virtualenv_opts = --always-copy


[resultstracker]
query_interval = 1
thread_pool_size = 100
logging = /etc/st2/logging.resultstracker.conf

[notifier]
logging = /etc/st2/logging.notifier.conf

[exporter]
logging = /etc/st2/logging.exporter.conf

[auth]
host = 0.0.0.0
port = 9100
use_ssl = False
debug = False
enable = True
logging = /etc/st2/logging.auth.conf

mode = standalone

# Note: Settings below are only used in "standalone" mode
backend = flat_file
backend_kwargs = {"file_path": "/etc/st2/htpasswd"}

# Base URL to the API endpoint excluding the version (e.g. http://myhost.net:9101/)
api_url =

[system]
base_path = /opt/stackstorm

[webui]
# webui_base_url = https://mywebhost.domain

[syslog]
host = 127.0.0.1
port = 514
facility = local7
protocol = udp

[log]
excludes = requests,paramiko
redirect_stderr = False
mask_secrets = True

[system_user]
user = stanley
ssh_key_file = /home/stanley/.ssh/id_rsa

[messaging]
url = amqp://stackstorm:stackstorm@<rabbitmq_ip>:5672//stackstorm

[ssh_runner]
remote_dir = /tmp

[mistral]
api_url = http://<ip>:9101
v2_base_url = http://<ip>:8989/v2

[coordination]
url = kazoo://<ip>:2181

[garbagecollector]
logging = /etc/st2/logging.garbagecollector.conf
action_executions_ttl = 10
action_executions_output_ttl = 10
trigger_instances_ttl = 1

lhill · August 27, 2018, 7:41pm

Do you have very long-running workflows, or items paused for > 24 hours?

virenderdubey · August 29, 2018, 6:07am

nope, Workflows max completes in 4-5 mins.
Just an FYI, We have ~3 months data in Mistral DB and as soon we cleaned up it to 3 days, the error has stopped coming. However it is little early to say it got permanently fixed. We will monitor the same for 1-2 more days and then confirm.