Yes, I know we should be moving to Orquesta, but until GA, I am stuck with mistral workflows.
We have a large, complex mistral workflow that calls other workflows, about 15 tasks total. We are connecting to network devices and we query large data sets with graphql and connect to routers with netmiko and napalm and issue queries. We write results to elasticsearch.
Inside the main workflow we gather the list of devices, then loop through the large set of routers (several dozen to 1000) and then execute the subworkflow for each of the routers, which does the aforementioned query and write to elastic.
We get buried after about 50 devices and things lock up pretty hard. We implemented a policy to throttle the behavior and that is working better. I am wondering if our expectations are in line with reality. We have a very large national network with 100K devices, so it takes about 10-15 mins to do 100 devices, i would like to do many more in parallel.
We are running in K8s at the moment, but load testing a single node to duplicate various errors we are seeing.
Wondering how to scale our mistral a bit more in our production. We are seeing the mistral server process eat up the CPU . When i look at the process i see:
/opt/stackstorm/mistral/bin/python /opt/stackstorm/mistral/bin/gunicorn --log-file /var/log/mistral/mistral-api.log -b 127.0.0.1:8989 -w 2 mistral.api.wsgi --graceful-timeout 10
It appears i can start more workers/threads for gunicorn…anyway any suggestions for performance optimization would be appreciated. I have turned down logs at this point to ERROR, and done various and sundry other things. Beside the policy, not much is making any difference.