Orquesta workflow is not changing to successful. It is running forever

wingi · February 25, 2021, 6:11pm

Hi,

I have a simple workflow collecting an array and afterwards starting core.http for every item in the array.
Therefore I use the with-items solution:

Do_Loop:
    with: <% ctx().list_of_items %>
    action: core.http
    input:
      url: "https://some.service.net/api/add/<% item().value %>"
      method: "PUT"
      headers: {"X-APP":"<% st2kv('system.appid') %>","X-APIKEY":"<% st2kv('system.apikey', decrypt=>true) %>","content-type":"application/json","Accept":"application/json"}
      body: '{
               "host": "<% item().value %>",
               "description": "Detail: <% item().event_id %>"
             }'

In my current example, the list contains 286 values which leads to 286 http action executions.
First I had the issue that the actions took longer and longer till everything stuck. After searching issues, etc. I found a PR where it is stated to use a coordination backend. (Acquire lock in workflow engine before handling action execution completion by m4dcoder · Pull Request #4595 · StackStorm/st2 · GitHub)

Therefore I installed and configure redis. Since then the actions run smooth and in short time, but the overall workflow is still in progress even if all 286 actions are executed.

By checking the wfl logs I get the following:
tail -F /var/log/st2/st2workflowengine.log
2021-02-25 18:04:23,417 140710782507280 DEBUG keyvalue [-] get_key key_id: appid, scope: st2kv.system, user: st2admin, decrypt: False
2021-02-25 18:04:23,706 140710782507280 DEBUG keyvalue [-] get_key key_id: apikey, scope: st2kv.system, user: st2admin, decrypt: True
2021-02-25 18:04:23,986 140710782507280 DEBUG keyvalue [-] get_key key_id: appid, scope: st2kv.system, user: st2admin, decrypt: False
2021-02-25 18:04:24,068 140710782507280 DEBUG keyvalue [-] get_key key_id: apikey, scope: st2kv.system, user: st2admin, decrypt: True
2021-02-25 18:04:24,553 140710782507280 DEBUG keyvalue [-] get_key key_id: appid, scope: st2kv.system, user: st2admin, decrypt: False
2021-02-25 18:04:24,638 140710782507280 DEBUG keyvalue [-] get_key key_id: apikey, scope: st2kv.system, user: st2admin, decrypt: True
2021-02-25 18:04:24,921 140710782507280 DEBUG keyvalue [-] get_key key_id: appid, scope: st2kv.system, user: st2admin, decrypt: False

It looks like it is still collecting the values from the keystore, but they are no longer needed. Every http action is already satisfied.
It is visible that all actions of the workflow already succeeded by looking on the WebUI and the CLI.

Has someone any idea what is going on here?

Looks like the workflow is running forever.

wingi · February 25, 2021, 6:24pm

Update:

After more than one hour it changed now to failed with following output:

{
  "output": null,
  "errors": [
    {
      "message": "Execution failed. See result for details.",
      "type": "error",
      "task_id": "Do_Loop"
    },
    {
      "message": "ToozConnectionError: Timeout reading from socket",
      "type": "error",
      "route": 1,
      "task_id": "Do_Loop"
    }
  ]
}

But the workflowengine log still goes on and by viewing the result in the WebUI, the finish time is still updated every time I click on the WFL:

arms11 · February 25, 2021, 9:54pm

Not exactly the answer you might be looking for, but have you tried running this in shell script or python runner instead of with-items in a declarative way? Just trying to understand if there is a growing contention with a tight loop. I could be wrong. Also, what version of ST2 is this?

wingi · February 26, 2021, 2:26pm

Hi,

yes I although thought about redesigning it to just use one action from Stackstorm perspective but this only solves this case. I think in general it should be possible to use although many with-items to benefit from the modular idea of it.

I used Stackstorm v3.2.0.

After another hour, the time is now no longer changing and the logs stopped.

wingi · March 3, 2021, 11:01am

UPDATE - Complete summary:
Installation: Stackstorm OVA Image
Stackstorm v 3.2.0
Redis installed as coordination backend.

I figured now out it is related to specific usage of with. Therefore I created two examples to demonstrate the behaviour. The examples contain less wfl context/data and is therefore faster than my real world use case, but in general it demonstrates the behaviour:

Example 1:

version: 1.0
description: A debugging workflow.

input:
  - array: <% list(range(60)) %>

output:
  - Array: <% ctx().array %>

tasks:
  Start:
    action: core.echo
    input:
      message: "Start"
    next:
      - when: <% completed() %>
        publish:
        - result: <% result() %>
        do: Do_Loop

###Loop over array
    with: <% ctx().array %>
    action: core.echo
    input:
      message: "This is just a test Item: <% item() %>"

Example 2:

version: 1.0
description: A debugging workflow.

input:
  - array: <% list(range(60)) %>

output:
  - Array: <% ctx().array %>

tasks:
  Start:
    action: core.echo
    input:
      message: "Start"
    next:
      - when: <% completed() %>
        publish:
        - result: <% result() %>
        do: Do_Loop

###Loop over array
    with:
        concurrency: 1
        items: <% ctx().array %>
    action: core.echo
    input:
      message: "This is just a test Item: <% item() %>"

Behaviour:
Example 1:
Behaviour is seen as before described in my posts. The looped steps are executed without any real delay but after the last step the overall wfl still runs for additional ~25 seconds till it succeeds.

Example 2:
The looped steps are executed one after the other. After the last step was executed the overall wfl succeeds within 3 seconds.

In these examples it doesn’t seem to be an issue because it only takes 25 seconds to finish, but as you can see in my previous posts for the real world use case it takes hours which leads to issues.

Can someone reproduce this issue?
Anything I can do to solve this issue while running actions in parallel?

If any further info is needed, please let me know here or in Slack.

Thanks.