This site is in read only mode. Please continue to browse, but replying, likes, and other actions are disabled for now.

⚠️ We've moved!

Hi there!

To reduce project dependency on 3rd party paid services the StackStorm TSC has decided to move the Q/A from this forum to Github Discussions. This will make user experience better integrated with the native Github flow, as well as the questions closer to the community where they can provide answers.

Use 🔗 Github Discussions to ask your questions.

Performance Optimization for workflows

Yes, I know we should be moving to Orquesta, but until GA, I am stuck with mistral workflows.

We have a large, complex mistral workflow that calls other workflows, about 15 tasks total. We are connecting to network devices and we query large data sets with graphql and connect to routers with netmiko and napalm and issue queries. We write results to elasticsearch.

Inside the main workflow we gather the list of devices, then loop through the large set of routers (several dozen to 1000) and then execute the subworkflow for each of the routers, which does the aforementioned query and write to elastic.

We get buried after about 50 devices and things lock up pretty hard. We implemented a policy to throttle the behavior and that is working better. I am wondering if our expectations are in line with reality. We have a very large national network with 100K devices, so it takes about 10-15 mins to do 100 devices, i would like to do many more in parallel.

We are running in K8s at the moment, but load testing a single node to duplicate various errors we are seeing.

Wondering how to scale our mistral a bit more in our production. We are seeing the mistral server process eat up the CPU . When i look at the process i see:

/opt/stackstorm/mistral/bin/python /opt/stackstorm/mistral/bin/gunicorn --log-file /var/log/mistral/mistral-api.log -b 127.0.0.1:8989 -w 2 mistral.api.wsgi --graceful-timeout 10

It appears i can start more workers/threads for gunicorn…anyway any suggestions for performance optimization would be appreciated. I have turned down logs at this point to ERROR, and done various and sundry other things. Beside the policy, not much is making any difference.

Let’s continue this discussion in Slack. We will need some more context from you about your environment and workflow.