HA failover with st2sensorcontainer and sensor partitioning

(Daniel Jay Haskin) #1

In a similar vein to this question: in an HA setup, how would you tell the st2sensorcontainer to fail over to the other node in the event of a failover? Or, say I partition my sensors using hash, then one of my nodes goes down. How can I make sure the sensors all still run? How can you tell the st2sensorcontainer to fail over to a passive node from an active one when the first one goes down?

(Eugen C.) #2

That’s an interesting topic.

At the moment sensorcontainers can’t work in active/passive mode.
More than that, multiple sensors are forked in st2sensorcontainers service don’t recover on multiple consecutive failures which could be especially painful.
And so while there is Sensors Partitioning, there is no real Sensors HA guarantee, - you can’t run 2 replicas. See: High Availability Deployment — StackStorm 2.9dev documentation

Or, say I partition my sensors using hash, then one of my nodes goes down.

Yeah, this is a good, convenient and user-friendly option which I’d like too: having Sensors partitioned via hashring with HA capabilities (Partitioning Sensors — StackStorm 2.8.1 documentation) and it’s something we’ve discussed internally recently as a potential field to improve.
Sadly, according to @kami and @lakstorm it’s pretty complicated to do in StackStorm platform and would require leader election, gossip between sensor nodes and some event de-duplication.
We don’t have a solution for that yet, but have ideas to improve it in future releases.

Alternative to that and pretty simple/practical improvement, we recently added “single sensor per process mode” in v2.8 https://github.com/StackStorm/st2/pull/4179. It works like this:

exec /opt/stackstorm/st2/bin/st2sensorcontainer \
  --config-file /etc/st2/st2.conf \
  --single-sensor-mode \

and somewhat related to file-based sensor partitioning.
This at least solves problem of controlling “what you want to run and where” for load-distribution/partitioning and ability to control recovery/restart specific sensors on failure based on monitoring layer from the user’s side, which was harder to do before due to forking nature of individual sensors in st2sensorcontainer (you had to restart entire service).
For example, with single sensor per container mode, Kubernetes can be that healing/failover/reschedule layer when the sensor goes down and so failed Pod is recovered by K8s automatically.

That’s the option I would suggest to follow for better than default st2 sensors Availability.

BTW, StackStorm v2.9 and v3.0+ will be a lot about HA improvements and we’re looking to enhance the Sensors HA story in future once we gather more feedback like that and so your further input/ideas would be very helpful :+1:

(Daniel Jay Haskin) #3

Well, thinking about it further, there’s really two types of sensors I’m interested in: webhooks and polling sensors.

Say I have two vms load balanced in an active active setup with no partitioning; that is, both nodes run all sensors.

If we are talking about webhooks, I am fine: a request to the webhooks will come in, get load balanced to one of the nodes, and that nodes’ webhook sensor will fire a single trigger.

It’s where we are talking about polling sensors that the issue arises. However, if all the polling sensors are written in house, they can be written to be ha aware pretty easily using stackstorm’s key value store as a place to put mutexes/locks to enable HA polling. Both sensors would try to obtain a lock. One wins the lock and conducts the poll. The other doesn’t and goes back to sleep.

As all of the polling sensors I will use will likely be written in house, maybe it would be best to simply run the st2sensorcontainer on both nodes with no partitioning in an active/active setup. Would this work? Does my argument make sense?

(Eugen C.) #4

Yes, absolutely.
If you have resources to build HA/failover implementation for every sensor itself to rely on some distributed locking primitives, - that would be a perfect approach.

And what you’re saying is exactly stated in our recommendations High Availability Deployment — StackStorm 2.9dev documentation

Currently st2sensorcontainer processes do not form a cluster and distribute work or take over new work if some nodes in the cluster disappear. It is possible for a sensor itself to be implemented with HA in mind so that the same sensor can be deployed on multiple nodes with the sensor managing active-active or active-passive.

Of course, not all our users are capable to implement something like that and so we’ll need some “in-house” improvements to make sensors HA easier in future.

(Daniel Jay Haskin) #5

Well I guess I’ll mark this issue solved after this then, but I wanted to leave some parting thoughts.

For our setup at least, sensor partitioning isn’t the best option because we have an active/active HA cluster, and I want webhooks to be available from either node.

However, for polling sensors, I think the wish I’d want granted is that stackstorm obtains a lock from zookeeper or redis for me in the PollingSensor class (using the tooz library, as it already does for other problems). In fact, the core idea – that of giving me a sensor class to work with that already obtains a lock – is one which is totally useful and backwards compatible.

Say for example that there was a LockingSensor and a PollingLockingSensor classes. Both sensors would start up, obtain a lock, and continue to renew the lock (in perhaps the case of the pollinglockingsensor). That would help me a lot because the lock needs to be obtained and held for the life of the sensor, and having a sensor class that I could just override that does that for me using current stackstorm infrastructure would really help. Perhaps I’ll log a feature request :slight_smile:

(Eugen C.) #6

@djhaskin987 Please do login the feature request, that idea with LockingSensor and PollingLockingSensor is really good.

(Daniel Jay Haskin) #7

Done: Feature Request: `LockingPollingSensor` to make HA aware polling sensors · Issue #4301 · StackStorm/st2 · GitHub