Help to tune performance optimization for stackstorm

Currently working on POC for using stackstorm. I went through the sample st2.conf and used the docker container option for deployment and conducted performance testing. Regardless of what configuration change i make for actionrunner, scheduler, workflow runner i dont see any improvements on concurrency, getting pretty bad performance for st2_get, st2_set api’s ranging from 2seconds for one user and 16 to 17seconds for 5 concurrent requests. Its sort of difficult to understand whether the config updates are taken into account by service for processing. With default configuration i was expecting stackstorm to perform well atleast for 40 concurrent requests

Any guidance on how to go about config changes besides st2.conf and to see whether its actually using the configuration. Its possible i am missing something here.

To start with, i would use st2_set and st2_get api’s and then increase the concurrency from 1 to 5 and observe the response time in webui console. Even with 1 request, its taking 2 seconds which is very bad. I am sure these api calls are using mongodb but even then 2 seconds is not acceptable

Any help is appreciated.

This doesn’t look normal. For CI/CD server we’re using requests/responses are maxed to sub-second in the worst case. For reference, that instance is based on c5.2xlarge AWS instance (8 vCPUs, 16GB RAM).

What’s the hardware resources you rely on? What’s the platform you’re using? OS, environment, cloud, instance size, etc. Are you using any HA configuration or not to scale out and distribute the load? (High Availability Deployment — StackStorm 3.1.0 documentation)

See recommended production requirements in terms of machine resources: System Requirements — StackStorm 3.1.0 documentation

Diagnose disk I/O, what’s taking CPU resources, memory pressure and more to understand where the bottleneck is and what could be scaled-out. Besides of that, how MongoDB and RabbitMQ clusters are configured and perform? Did you monitor/instrument them? They’re the most important moving parts st2 relies on, - if they’re slow everything is slow.

Thanks. To alleviate the perf issue from prod environment. I created virtualbox vm with the following configurations

12GB RAM, Intel i5 processor, configured to use all CPU’s, Windows 7, 64 bit
docker-compose.yaml for deploying ST2 containers
Ubuntu 16.04.6 LTS

Testing

Used st2_get, st2_set actions for testing
With one request still getting 1second response
As the volume ramp up with 5 concurrent users, there are sporaidic 1sec,2sec,3sec responses

containers are running with default configuration without any changes

Did some more testing, it turns out when concurrency gets increased all the request seem to stuck on something, lock ? and then everything gets cleared up at the same time. I also noted that it takes time to complete the execution, there are delays from scheduler to schedule the action runner and the action runner takes most of the time. any insight is appreciated. I can upload logs and other configs if needed

@smurugu Welcome! Glad to have you here!

Are you able to performance test on a closer to prod like system? My guess is that with Docker in Virtualbox on Windows you have a few layers of unintentional bottlenecks that are going to be hard to diagnose or actually resolve. I’d recommend maybe trying this directly on a linux box maybe in ec2 so you can remove the hypervisor on Windows layer and see if you have significantly better results (you should).

When ever I test any software - I do functional testing with a setup similar to yours (but on a Mac which I can run linux docker natively), then performance testing on a setup that is closer to what production to be. In addition, Virutalbox is definitely not known for their superior close to bare metal workstation performance like VMWare Workstation is.

My 2 cents. YMMV.

It sounds like there is a deadlock situation going on with st2_set and st2_get calls when these calls are made concurrently, i think the issue persists regardless of the environment; I am doing more testing to confirm the behavior with other scenarios. I will update the thread by next week

1 Like

@smurugu roger that. Let us know what you find.

We use HA configuration, initially with 4vcpu and then increased upto 16vcpu. Following is the configuration

4 nodes, 16cpu, 16gb with the following configuration
6 mongodb-ha
2 st2 web
4 rules engines
12 st2-st2api
1 st2-api
9 rabbitmq
30 schedulers
12 workflow engines
30 action runners

Simple st2_get, st2_set calls are taking 2seconds, when concurrency gets increased its all over the place. Besides st2_set, get i tried echo actions as well, still 2 seconds. When i increase the concurrency everything is out of bounce