Orquesta policy delay

We are using Stackstorm policies to limit the number of executions of a particular workflow as documented in the top of this page : Policies — StackStorm 3.1.0 documentation
like this:

name: my_action.concurrency
description: Limits the concurrent executions for my action.
enabled: true
resource_ref: demo.my_action
policy_type: action.concurrency
parameters:
    action: delay
    threshold: 10

Works great - we get to 10 executions running, the 11th, 12th, etc go to “delayed”, until some of the first 10 finish.

The app consumes messages from Kafka. There are two options I can see to consume from Kafka -
A. read everything from Kafka, invoke a stackstorm execution per message. Hit the limit of 10 executions, then the rest go to ‘delayed’ until they can be worked by ST2. This uses ST2 as a queue, and we have hit a limit (several hundred) delayed messages, after which ST2 almost crashed.
OR
B. Detect when ST2 is able to accept more work (backpressure). For each Kafka message, we would invoke ST2 then check if the execution went to ‘delayed’. If so, sleep and poll ST2 until the execution went to ‘running’ or a terminal state, then get the next Kafka message, repeat.

Questions:

  1. Which is best. - Option A or Option B? Is there a better way to do this?
  2. For Option B - how long the ‘delay’ will hold / is there a timeout for the delay? Can we set it? What state to the workflows go to after the timeout is reached?
2 Likes

Ideally, I would say option A because you don’t have to keep polling st2 which is more resource heavy. Can you provide any more context on the limit you’re hitting with option A? In my recollection, the delayed action executions are not consuming resources. As each action execution completes, it will query for the next delayed to resume.

What seems to have happened is … every 2500 milliseconds the scheduler re-polls to see if it can drive the delayed workflows --> scheduled --> running. With enough (400 - 600) workflows in ‘delayed’, the scheduler and/or Redis became swamped / CPU bound. We had MongoDB problems at the same time, so a little difficult to tell.

For Option B, any idea how long things can stay in ‘delayed’?

How long an action execution stays on delayed as a result of the concurrency policy depends on what the action is doing. So you probably have better idea how long it takes for each action execution under the policy to complete?

they can be 5 - 7 minutes to complete. I am asking - will the workflows that cannot run right away stay in delayed? Is there a “delayed too long - give up” timeout or will they just stay enqueued until they are consumed?

There is no timeout specifically for workflow execution. It can sit in the queue until there is resource freed up to process it.

You are right, I forgot about the scheduler. There’s a scheduler sleep_interval (secs) option you can set in the st2.conf file. It defaults to 0.1 second. You can try to change it to an acceptable value that is easier on the system. Note that this will also delay normal scheduling up to the sleep interval.

OK - thanks for explaining how that works, and no timeout for delayed executions.

The scheduler - yeah. That seems to imply trading some performance for the ability to queue up the requests. That would drive us towards Option B with a longer polling interval…

Thank you for helping us think through this.