Workflows get stuck in pausing state


(Eric Moody) #1

Hi,

We have some mistral workflow, which use the pause before before when calling an external system. The external system then calls back to ST2 to resume the workflow when it completes.

However we have been running into a problem where workflow get stuck in the pausing state. Since they are not in paused state they can not be resolved.

We can work around the issue by manually editing the execution in mongodb and marking it as paused. We do that by:

mongo -u stackstorm -p ************* st2
db.action_execution_d_b.findOneAndUpdate({"_id" : ObjectId(“5b44dac837a8ce5cebcdbb61”)}, { $set: {“status”:“paused”} }, { upsert: true })

Where ObjectID is the execution ID.

We would like to figure out why this is happening and prevent it from happening. The workflows are rather large. I am thinking that maybe some values need to be tuned, but am unsure which ones.


(W Chan) #2

A st2 action goes into pausing or canceling state because one of its subtasks (or action executed under it as part of a workflow) is still in an active state. In order to troubleshoot, you’ll have to looking into the details of the st2 action and identify the children and their states.


(Eric Moody) #3

When I looked yesterday, all of the subtasks were either completed or paused. Just the main mistral workflow was in the pausing state. Next time it happens I will double check that.

Though hoping to find the reason, why a task didn’t completely transition.


(W Chan) #4

So here’s a very simple example below which works for me.

ubuntu@cadmus:~/st2$ cat /opt/stackstorm/packs/sandbox/actions/workflows/pause_before.yaml 
version: '2.0'

sandbox.pause_before:
    tasks:
        task1:
            action: core.noop
            on-success:
                - task2
        task2:
            pause-before: true
            action: core.noop
            on-success:
                - task3
        task3:
            action: core.noop

Running the above workflow result in the following. Please note that the execution automatically paused. Then I manually resumed the execution and it succeeded.

ubuntu@cadmus:~/st2$ st2 run sandbox.pause_before -a
To get the results, execute:
 st2 execution get 5b46656b8006e60b0b18e128

To view output in real-time, execute:
 st2 execution tail 5b46656b8006e60b0b18e128

ubuntu@cadmus:~/st2$ st2 execution get 5b46656b8006e60b0b18e128
id: 5b46656b8006e60b0b18e128
action.ref: sandbox.pause_before
parameters: None
status: paused
start_timestamp: Wed, 11 Jul 2018 20:15:39 UTC
end_timestamp: 
result: 
  tasks: []
+--------------------------+------------------------+-------+-----------+-----------------+
| id                       | status                 | task  | action    | start_timestamp |
+--------------------------+------------------------+-------+-----------+-----------------+
| 5b46656c8006e60b0b18e12b | succeeded (0s elapsed) | task1 | core.noop | Wed, 11 Jul     |
|                          |                        |       |           | 2018 20:15:40   |
|                          |                        |       |           | UTC             |
+--------------------------+------------------------+-------+-----------+-----------------+

ubuntu@cadmus:~/st2$ st2 execution resume 5b46656b8006e60b0b18e128
id: 5b46656b8006e60b0b18e128
action.ref: sandbox.pause_before
parameters: None
status: resuming
start_timestamp: Wed, 11 Jul 2018 20:15:39 UTC
end_timestamp: 
result: 
  tasks: []
+--------------------------+------------------------+-------+-----------+-----------------+
| id                       | status                 | task  | action    | start_timestamp |
+--------------------------+------------------------+-------+-----------+-----------------+
| 5b46656c8006e60b0b18e12b | succeeded (0s elapsed) | task1 | core.noop | Wed, 11 Jul     |
|                          |                        |       |           | 2018 20:15:40   |
|                          |                        |       |           | UTC             |
+--------------------------+------------------------+-------+-----------+-----------------+

ubuntu@cadmus:~/st2$ st2 execution get 5b46656b8006e60b0b18e128
id: 5b46656b8006e60b0b18e128
action.ref: sandbox.pause_before
parameters: None
status: succeeded (21s elapsed)
result_task: task3
result: 
  failed: false
  return_code: 0
  succeeded: true
start_timestamp: Wed, 11 Jul 2018 20:15:39 UTC
end_timestamp: Wed, 11 Jul 2018 20:16:00 UTC
+--------------------------+------------------------+-------+-----------+-----------------+
| id                       | status                 | task  | action    | start_timestamp |
+--------------------------+------------------------+-------+-----------+-----------------+
| 5b46656c8006e60b0b18e12b | succeeded (0s elapsed) | task1 | core.noop | Wed, 11 Jul     |
|                          |                        |       |           | 2018 20:15:40   |
|                          |                        |       |           | UTC             |
| 5b46657e8006e60b0b18e12d | succeeded (1s elapsed) | task2 | core.noop | Wed, 11 Jul     |
|                          |                        |       |           | 2018 20:15:58   |
|                          |                        |       |           | UTC             |
| 5b46657f8006e60b0b18e12f | succeeded (0s elapsed) | task3 | core.noop | Wed, 11 Jul     |
|                          |                        |       |           | 2018 20:15:59   |
|                          |                        |       |           | UTC             |
+--------------------------+------------------------+-------+-----------+-----------------+

(Eric Moody) #5

Yep, about 99% of the time the workflow pauses normally. Just about 1% of the time it gets stuck in the pausing state.

I did find this post.
How to troubleshoot a mistral workflow that is stuck in "running" state? Which describes how to check the state in mistral

Next time this happens I’ll check with mistral to see if the states are in sync. From there we can get a better idea where the break down is.