How to troubleshoot a mistral workflow that is stuck in "running" state?

workflows
mistral
troubleshooting

#1

Is there an easy way to determine why a mistral workflow still sits on Running even though the last action that ran within the workflow returned Failed? In the past I found it was because the action that would get called next (there’s a catch-all on-error in each action) had some bad jinja - but I removed all the jinja and the workflow is still Running.

# st2 --version
st2 2.6.0

e.g.

$ st2 execution get 5ae9db797709b603ee05d827
id: 5ae9db797709b603ee05d827
action.ref: [redacted]_pdb_refresh.checkin
parameters:
  callback_authorization_path: secret/cloud/prod/api_token
  callback_url: none_specified
  cdb_host: clouddb01.test.[redacted].com
  cdb_version: 12.1.0.2
  desktop_db: false
  pdb_class: prt
  pdb_name: prt_c572297f4f2a49529cf3802fbd
status: running (412s elapsed)
start_timestamp: Wed, 02 May 2018 15:38:32 UTC
end_timestamp:
+--------------------------+------------------------+---------------------+---------------------+-----------------+
| id                       | status                 | task                | action              | start_timestamp |
+--------------------------+------------------------+---------------------+---------------------+-----------------+
| 5ae9db7c7709b60bcb3e8b09 | succeeded (3s elapsed) | queue_state         | [redacted]_common.loop_un | Wed, 02 May     |
|                          |                        |                     | til                 | 2018 15:38:35   |
|                          |                        |                     |                     | UTC             |
| 5ae9db7e7709b60bcb3e8b0b | succeeded (3s elapsed) | set_environment_var | consul.kv_get       | Wed, 02 May     |
|                          |                        | iable               |                     | 2018 15:38:38   |
|                          |                        |                     |                     | UTC             |
| 5ae9db827709b60bcb3e8b0d | failed (3s elapsed)    | shutdown_pdb        | [redacted]_dba_oracle.run | Wed, 02 May     |
|                          |                        |                     | _sql                | 2018 15:38:42   |
|                          |                        |                     |                     | UTC             |
+--------------------------+------------------------+---------------------+---------------------+-----------------+

Here’s my workflow:

version: "2.0"

name: [redacted]_pdb_refresh.checkin
description: >
    Perform a checkin on a PDB
workflows:
  main:
    type: direct
    input:
      - cdb_user
      - vault_path
      - pdb_class
      - pdb_name
      - callback_url
      - desktop_db
      - cdb_host
      - cdb_version
      - ancillary_email
      - callback_authorization_path
    output:
      messages: <% $.status_message %>
    tasks:
      # This task will check the value of [redacted]/stackstorm/[redacted]_pdb_refresh/loop_until/{{ _.pdb_class }}/enabled in Consul
      # if it is 'true' then we'll continue on with the workflow.  If it's false
      # then we loop.  This is a way of 'pausing' the queue...
      queue_state:
        action: [redacted]_common.loop_until
        input:
          loop_key: [redacted]/stackstorm/[redacted]_pdb_refresh/loop_until/{{ _.pdb_class }}/enabled
          loop_method: consul
          loop_until_value: "true"
          sleep_seconds: 30
          loop_iterations: 5
          loop_forever: true
        on-success:
          - set_environment_variable
        on-error:
          - post_notification
        publish:
          status_message: "Queue appears to be unpaused.  Proceeding with PDB checkin"
        publish-on-error:
          status_message: "Failed to complete task `queue_state`"

      # This task just sets an environment variable for the user who submitted this workflow.
      set_environment_variable:
        action: consul.kv_get
        input:
          key: [redacted]/stackstorm/[redacted]_pdb_refresh/cdb_hosts
          from_json: true
        publish:
          environment_name: <% switch(bool($.pdb_name) => $.pdb_name, bool(env().get('__actions').get('st2.action').st2_context.parent.get('api_user')) => "dtdb_"+$.pdb_class+"_"+env().get('__actions').get('st2.action').st2_context.parent.api_user, 1 = 1 => env().get('__actions').get('st2.action').st2_context.parent.user) %>
          new_pdb_name: <% switch(bool($.desktop_db) => concat('dtdb_', $.pdb_class, '_', $.pdb_name.toLower()), 1=1 => $.pdb_name.toLower()) %>
          cdb_servicename: <% task('set_environment_variable').result.result[1].Value[$.pdb_class].cdb_map[$.cdb_host][$.cdb_version] %>
        on-success:
          - shutdown_pdb

      shutdown_pdb:
        action: [redacted]_dba_oracle.run_sql
        input:
          host: "{{ _.cdb_host }}"
          user: "{{ _.cdb_user }}"
          vault_path: "{{ _.vault_path }}"
          service_name: "{{ _.cdb_servicename }}"
          bad_sql_ok: false
          sysoper: false
          sql: >
           DECLARE
              PDB_NAME VARCHAR2(64) := '{{ _.new_pdb_name }}';
              pdb_already_closed EXCEPTION ;
              PRAGMA EXCEPTION_INIT(pdb_already_closed, -65020);
              pdb_doesnot_exist EXCEPTION ;
              PRAGMA EXCEPTION_INIT(pdb_doesnot_exist, -1031);
            BEGIN
              EXECUTE IMMEDIATE 'ALTER PLUGGABLE DATABASE '||PDB_NAME||'
                CLOSE ABORT';
              EXCEPTION
              WHEN pdb_already_closed THEN
                dbms_output.put_line('PDB '||PDB_NAME||' is already closed.');
                NULL;
              WHEN pdb_doesnot_exist THEN
                dbms_output.put_line('PDB '||PDB_NAME||' does not exist.');
                NULL;
              WHEN OTHERS THEN
                dbms_output.put_line('Some other error occured. '||sqlerrm);
                raise;
            END;
        publish-on-error:
          status_message: "Failed to complete task shutdown_pdb: {{ task('shutdown_pdb').result.result.ERRORS }}"
        publish:
          status_message: "Successfully shutdown PDB {{ _.pdb_name }}"
        on-success:
          - drop_pdb: <% $.desktop_db = true %>
          - umount_pdb_vol: <% $.desktop_db = false %>
        on-error:
          - post_notification

      umount_pdb_vol:
        action: [redacted]_dba_oracle.unmount_nfs_volume
        input:
          hostnames:
            - "{{ _.cdb_host }}"
          mountpoint: "/oradata/cloud/{{ _.pdb_class }}/{{ _.new_pdb_name }}"
          destroy: true
          cls: "{{ _.pdb_class }}"
        publish-on-error:
          status_message: "Failed to complete task unmount_pdb_vol: {{ task('umount_pdb_vol').result.result.errors[0] }}"
        publish:
          status_message: "Successfully completed task unmount_pdb_vol: {{ task('umount_pdb_vol').result.result.results[0] }}"
        on-success:
          - drop_pdb
        on-error:
          - post_notification

      drop_pdb:
        action: [redacted]_dba_oracle.run_sql
        input:
          host: "{{ _.cdb_host }}"
          user: "{{ _.cdb_user }}"
          vault_path: "{{ _.vault_path }}"
          service_name: "{{ _.cdb_servicename }}"
          bad_sql_ok: false
          sysoper: false
          sql: >
           DECLARE
              PDB_NAME VARCHAR2(30) := '{{ _.new_pdb_name }}';
              pdb_doesnot_exist EXCEPTION ;
              PRAGMA EXCEPTION_INIT(pdb_doesnot_exist, -65011);
            BEGIN
              EXECUTE IMMEDIATE 'DROP PLUGGABLE DATABASE '||PDB_NAME||'
                INCLUDING DATAFILES';
              EXCEPTION WHEN pdb_doesnot_exist THEN
                dbms_output.put_line('This PDB doesnt seem to exist.');
                NULL;
              WHEN OTHERS THEN
                dbms_output.put_line('Some other error occured. '||sqlerrm);
                raise;
            END;
        publish-on-error:
          status_message: "Failed to complete task drop_pdb: {{ task('drop_pdb').result.result.ERRORS }}"
        publish:
          status_message: "Successfully dropped pdb {{ _.new_pdb_name }}"
        on-success:
          - make_callback_succeeded: "{{ _.callback_url != 'none_specified' }}"
        on-error:
          - post_notification

      make_callback_succeeded:
        action: core.http
        input:
          url: "{{ _.callback_url }}"
          body: '{"status":"succeeded","pdb_class":"{{ _.pdb_class }}","pdb_name":"{{ _.pdb_name }}"}'
          method: PUT
          verify_ssl_cert: false
        on-error:
          - post_notification

      post_notification:
        action: slack.chat.postMessage
        input:
          channel: "dba_alerts"
          text: |
            There was a failure checking in the pdb *{{ _.pdb_name }}* from host *{{ _.cdb_host }}*
            ```{{ _.status_message }}```
            For more information check the stackstorm execution_id @
            https://dbaautomation.dba.overstock.com.com/#/history/{{ env().st2_execution_id }}/general

          username: "StackStorm Bot"
          icon_emoji: ":robot-arm:"
          as_user: "false"
        on-complete:
          - fail

Workflows get stuck in pausing state
(W Chan) #3

Can you identify the mistral workflow execution ID and then query mistral to see what state the workflow execution is in and what tasks has been recorded at mistral? Please post the results here.

# Identify the Mistral workflow execution ID which is different than the st2 action execution ID
st2 execution get <st2-action-execution-id> -dj | grep workflow_execution_id

# Query the state of the workflow execution in Mistral
mistral execution-get <workflow-execution-id>

# Query the list of task executions for the workflow at Mistral
mistral task-list <workflow-execution-id>

#4
[root@dbaautomation01 mistral]# st2 execution get 5ae9f6dfe390ea125e1d541f -dj | grep workflow_execution_id
                "workflow_execution_id": "75bc2fb0-d29f-48d1-b83f-74550736b20b",
                "workflow_execution_id": "75bc2fb0-d29f-48d1-b83f-74550736b20b",
                "workflow_execution_id": "75bc2fb0-d29f-48d1-b83f-74550736b20b",
[root@dbaautomation01 mistral]# mistral execution-get 75bc2fb0-d29f-48d1-b83f-74550736b20b
WARNING (keystone) You must either provide a valid token or a password (api_key) and a user.
+--------------------+--------------------------------------+
| Field              | Value                                |
+--------------------+--------------------------------------+
| ID                 | 75bc2fb0-d29f-48d1-b83f-74550736b20b |
| Workflow ID        | a327a6d3-5c20-4e9e-b790-ce9bd2a786ce |
| Workflow name      | [redacted]_pdb_refresh.checkin.main  |
| Workflow namespace |                                      |
| Description        |                                      |
| Task Execution ID  | <none>                               |
| State              | RUNNING                              |
| State info         | None                                 |
| Created at         | 2018-05-02 17:34:24                  |
| Updated at         | 2018-05-02 17:34:24                  |
+--------------------+--------------------------------------+
[root@dbaautomation01 mistral]# mistral task-list a327a6d3-5c20-4e9e-b790-ce9bd2a786ce
WARNING (keystone) You must either provide a valid token or a password (api_key) and a user.
+----+------+---------------+--------------------+--------------+-------+------------+------------+------------+
| ID | Name | Workflow name | Workflow namespace | Execution ID | State | State info | Created at | Updated at |
+----+------+---------------+--------------------+--------------+-------+------------+------------+------------+
|    |      |               |                    |              |       |            |            |            |
+----+------+---------------+--------------------+--------------+-------+------------+------------+------------+
[root@dbaautomation01 mistral]# mistral task-list 75bc2fb0-d29f-48d1-b83f-74550736b20b
WARNING (keystone) You must either provide a valid token or a password (api_key) and a user.
+--------------------------------------+--------------------------+-------------------------------+--------------------+--------------------------------------+---------+------------+---------------------+---------------------+
| ID                                   | Name                     | Workflow name                 | Workflow namespace | Execution ID                         | State   | State info | Created at          | Updated at          |
+--------------------------------------+--------------------------+-------------------------------+--------------------+--------------------------------------+---------+------------+---------------------+---------------------+
| a12e07b7-579b-41f6-8b96-16cd2929e79b | queue_state              | [redacted]_pdb_refresh.checkin.main |                    | 75bc2fb0-d29f-48d1-b83f-74550736b20b | SUCCESS | None       | 2018-05-02 17:34:24 | 2018-05-02 17:35:50 |
| 455c66eb-6af9-4f57-af0c-1651e40ca474 | shutdown_pdb             | [redacted]_pdb_refresh.checkin.main |                    | 75bc2fb0-d29f-48d1-b83f-74550736b20b | RUNNING | None       | 2018-05-02 17:34:26 | <none>              |
| 8904f2f5-c63b-47bf-a468-6280395a4c6b | set_environment_variable | [redacted]_pdb_refresh.checkin.main |                    | 75bc2fb0-d29f-48d1-b83f-74550736b20b | SUCCESS | None       | 2018-05-02 17:35:50 | 2018-05-02 17:34:26 |
+--------------------------------------+--------------------------+-------------------------------+--------------------+--------------------------------------+---------+------------+---------------------+---------------------+

So it looks like even though shutdown_pdb is red in the interface (and shows a status of ‘Failed’) it’s actually still running?


(W Chan) #5

Can you post the output of st2 execution get 5ae9db827709b60bcb3e8b0d on the shutdown_pdb execution in st2? I want to see how/why it failed in st2. Please remove any sensitive info before posting. Thanks.


#6

Hmm, those id’s aren’t found anymore. Do they age out?

# st2 execution get 5ae9db827709b60bcb3e8b0d
Action Execution "5ae9db827709b60bcb3e8b0d" is not found.

(W Chan) #7

Yes, there’s garbage collection. Configuration at Purging Old Operational Data — StackStorm 2.7.1 documentation.


(W Chan) #8

So can you reproduce this issue? If you see this again, please pick up where we left off and provide refreshed data. Thanks.


#9

Here’s another run.

Mistral shows the action as RUNNING

 $ mistral task-list 4af91669-b7ae-441c-87a5-1ebf8a82f394
WARNING (keystone) You must either provide a valid token or a password (api_key) and a user.
+--------------------------------------+--------------------------+-------------------------------+--------------------+--------------------------------------+---------+------------+---------------------+---------------------+
| ID                                   | Name                     | Workflow name                 | Workflow namespace | Execution ID                         | State   | State info | Created at          | Updated at          |
+--------------------------------------+--------------------------+-------------------------------+--------------------+--------------------------------------+---------+------------+---------------------+---------------------+
| fd5d9757-a079-4ef1-b50f-342f1c8fe479 | queue_state              | [redacted]_pdb_refresh.checkin.main |                    | 4af91669-b7ae-441c-87a5-1ebf8a82f394 | SUCCESS | None       | 2018-05-04 15:27:52 | 2018-05-04 15:28:55 |
| 945a9446-64ed-4f29-a9ec-73e8223e755a | set_environment_variable | [redacted]_pdb_refresh.checkin.main |                    | 4af91669-b7ae-441c-87a5-1ebf8a82f394 | SUCCESS | None       | 2018-05-04 15:28:55 | 2018-05-04 15:28:57 |
| d771f9c2-4c81-4f83-b7a5-a12d7ee714ab | shutdown_pdb             | [redacted]_pdb_refresh.checkin.main |                    | 4af91669-b7ae-441c-87a5-1ebf8a82f394 | RUNNING | None       | 2018-05-04 15:28:57 | <none>              |
+--------------------------------------+--------------------------+-------------------------------+--------------------+--------------------------------------+---------+------------+---------------------+---------------------+

Stackstorm shows it as failed:

$ st2 execution get 5aec7bf47709b603d140658e
id: 5aec7bf47709b603d140658e
action.ref: [redacted]_pdb_refresh.checkin
parameters:
  callback_authorization_path: secret/cloud/prod/api_token
  callback_url: none_specified
  cdb_host: clouddb01.test.[redacted].com
  cdb_version: 12.2.0.1
  desktop_db: false
  pdb_class: hub
  pdb_name: hub_5d2ae12297604e57b76c578d546bf03f
status: running (318s elapsed)
start_timestamp: Fri, 04 May 2018 15:27:48 UTC
end_timestamp:
+--------------------------+-------------------------+----------------------+----------------------+------------------+
| id                       | status                  | task                 | action               | start_timestamp  |
+--------------------------+-------------------------+----------------------+----------------------+------------------+
| 5aec7bf97709b609cec9f817 | succeeded (62s elapsed) | queue_state          | [redacted]_common.loop_unt | Fri, 04 May 2018 |
|                          |                         |                      | il                   | 15:27:53 UTC     |
| 5aec7c387709b609cec9f819 | succeeded (1s elapsed)  | set_environment_vari | consul.kv_get        | Fri, 04 May 2018 |
|                          |                         | able                 |                      | 15:28:56 UTC     |
| 5aec7c397709b609cec9f81b | failed (3s elapsed)     | shutdown_pdb         | [redacted]_dba_oracle.run_ | Fri, 04 May 2018 |
|                          |                         |                      | sql                  | 15:28:57 UTC     |
+--------------------------+-------------------------+----------------------+----------------------+------------------+

(Denis) #10

Have the same issue.

$ st2 execution get 5af5507fc221997ad85fcc6b
id: 5af5507fc221997ad85fcc6b
action.ref: bcd.deploy-release
parameters:
  level: dev
  release: release.6.18.0
status: running (195827s elapsed)
start_timestamp: Fri, 11 May 2018 08:12:47 UTC
end_timestamp:
+--------------------------+--------------------------+--------------+----------------------+-------------------------------+
| id                       | status                   | task         | action               | start_timestamp               |
+--------------------------+--------------------------+--------------+----------------------+-------------------------------+
[skipped...]
| 5af552bdc221997ad85fcc80 | failed (122414s elapsed) | eagle        | core.local           | Fri, 11 May 2018 08:22:21 UTC |
| 5af553b6c221997ad85fcc82 | succeeded (0s elapsed)   | notify_slack | chatops.post_message | Fri, 11 

May 2018 08:26:30 UTC |
±-------------------------±-------------------------±-------------±---------------------±------------------------------+

BUT “st2 execution get 5af552bdc221997ad85fcc80 -dj|grep workflow_execution_id” show me empty output.
stackstorm version is 2.7.1-2

Workflow looks like this:

autocomplete:
  publish:
    task_name: "<% $.level %>: Build autocomplete finished"
  publish-on-error:
    task_name: "<% $.level %>: Build autocomplete failed"
  workflow: eagle
  wait-before: 60
  input:
    celerybeat_hostname: <% $.celerybeat_hostname %>
    cmd: "python manage.py build_autocomplete 2>&1 >> autocomplete.out"
  retry:
    count: 3
    delay: 60
  on-success:
    - next_tasks
    - notify_slack
  on-error:
    - notify_slack
    - fail
eagle:
type: direct
input:
  - celerybeat_hostname
  - cmd
output:
  stdout: <% $.stdout %>
  stderr: <% $.stderr %>
tasks:
  eagle:
    action: core.local
    publish:
      stderr: "<% task(eagle).result.stderr %>"
      stdout: "<% task(eagle).result.stdout %>"
    input:
      timeout: 259200 # 3 days
      cmd: "ssh <% $.celerybeat_hostname %> --no-tty -c \"<% $.cmd %>\" 2>&1|tail -n 1000"

(Nick Maludy) #11

Guessing this is related to: Undefined variables cause mistral workflows to hang (st2 2.2.1) · Issue #3723 · StackStorm/st2 · GitHub


#12

Maybe? Although in his example the status in st2 is ‘RUNNING’ - but with me st2 is ‘FAILED’ while mistral is ‘RUNNING’.


(W Chan) #13
  • Are these failures consistent? Can they be reproduced?
  • Do you still have the details for shutdown_pdb? I would like to see the st2 execution get for this action execution.
  • As for the latest autocomplete workflow, is $.level defined somewhere?