Any one installed Stackstorm-ha succesfully?

(Hnanchahal) #1

Have been trying to install the Stackstom-ha on our Kubernetes cluster but the installation never succeeds. Most of the pods are in the CrashLoopBackOff state and time out after multiple retries once the backoff limit is exhausted.

any pointers?

(Lindsay Hill) #2

Try going through this thread - there’s some debugging suggestions in there StackStorm with HA on Kubernetes

(Lindsay Hill) #3

This recent issue - and soon to be merged PR - might also be relevant st2 pods stuck in loop · Issue #57 · StackStorm/stackstorm-ha · GitHub

(Tomaz Muraus) #4

Yeah, it’s likely the same issue as in st2 pods stuck in loop · Issue #57 · StackStorm/stackstorm-ha · GitHub.

That PR has been merged, so if you redeploy the cluster with latest version of StackStorm, it should work.

(Hnanchahal) #5

The initial installation was failing as Rabbitmq and Mongodb were not able to fetch any persistent storage. Looking at the documentation, i am not able to figure how to provide the persistent storage claims to the helm installation? Can some provide me with the overrides?

(Eugen C.) #6

@hnanchahal While we’re happy to assist with any stackstorm-ha focused questions related to our charts, we can’t help with proper Kubernetes cluster setup and configuration as it’s something from the initial prerequisites.

Your issue is similar to StackStorm with HA on Kubernetes - persistentVolumeClaims configuration issue
There are some links in there to StackOverflow and Kubernetes Github that may help to investigate and maybe configure Persistent Storage in K8s. But it’s usually more involved, depending on your cloud/platform provider.

(Hnanchahal) #7

Thanks for the response. I am able to get the K8 configured and persistent claims set up. My question was around how to run helm and use my persistent claims. i.e I was able to run the mongodb helm charts/stable/mongodb-replicaset at master · helm/charts · GitHub by using the override

helm install --name test-release stable/mongodb --set persistence.existingClaim=persistent-kube-storage-claim

my question is around passing in the claim name to the installation for it to succeed.

(Eugen C.) #8

@hnanchahal Got it now.

Looking deeper, from the list of supported values for mongodb-replicaset Helm chart charts/values.yaml at master · helm/charts · GitHub you can’t set any specific already created and manually managed volumeClaim(s) as it’s all created automatically by K8s StatefulSet and (from what I understand) is different/unique for every new replica in HA mode, depending on which node this or that pod was deployed.

You probably confusing it with mongodb Helm chart which indeed has such option persistence.existingClaim that works only for single-node deployment: charts/values.yaml at master · helm/charts · GitHub

In stackstorm-ha we’re using mongodb-replicaset more HA-friendly version, not mongodb Helm chart.

So to recap, in a normal HA multi-node situation you don’t want to provide specifics to K8s Helm but instead configure K8s storage/volumes and Helm chart will create volumeClaims itself as part of the StatefulSet VolumeClaimTemplates:

$ kubectl get persistentVolumeClaims -l release=worn-fox
NAME                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-worn-fox-rabbitmq-ha-0     Bound    pvc-dc278777-3b91-11e9-bd33-080027e5ee5c   8Gi        RWO            standard       26d
data-worn-fox-rabbitmq-ha-1     Bound    pvc-6a205a73-3b92-11e9-bd33-080027e5ee5c   8Gi        RWO            standard       26d
data-worn-fox-rabbitmq-ha-2     Bound    pvc-df51368b-3b92-11e9-bd33-080027e5ee5c   8Gi        RWO            standard       26d
datadir-worn-fox-mongodb-ha-0   Bound    pvc-dc226509-3b91-11e9-bd33-080027e5ee5c   10Gi       RWO            standard       26d
datadir-worn-fox-mongodb-ha-1   Bound    pvc-3580e6fa-3b92-11e9-bd33-080027e5ee5c   10Gi       RWO            standard       26d
datadir-worn-fox-mongodb-ha-2   Bound    pvc-aed5dc24-3b92-11e9-bd33-080027e5ee5c   10Gi       RWO            standard       26d

On a related note, understanding what you’re trying to do, I think you want to dig here: https://stackoverflow.com/questions/46442238/can-i-rely-on-volumeclaimtemplates-naming-convention and https://docs.okd.io/latest/install_config/persistent_storage/selector_label_binding.html and charts/mongodb-statefulset.yaml at ececcab3d7dc8d49c3760223587921abf1e67061 · helm/charts · GitHub

Hope that gives some more pointers and ideas.

(Hnanchahal) #9

Thanks armab. That helped.The problem in my case was that i did not had any default storage class set. I am past that issue and was able to see file shares created and pvc generated for rabbitmq and mongodb.

NAME                                  STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
azurefile                             Bound     pvc-8a93eecb-50cf-11e9-a447-000d3a35cb71   10Gi       RWX            azurefile      38m
data-happy-scorpion-rabbitmq-ha-0     Bound     pvc-a4d62f14-50d3-11e9-a447-000d3a35cb71   8Gi        RWO            azurefile      9m30s
                                                                                    34m
datadir-happy-scorpion-mongodb-ha-0   Bound     pvc-a4b3289e-50d3-11e9-a447-000d3a35cb71   10Gi       RWO            azurefile      9m30s
                                                                                  34m

i am facing other roadblocks now. Here is the dump of the rabbitmq pod. The container started but then failed.

 Normal   Pulling           8m43s                   kubelet, kube-worker-1  pulling image "busybox:latest"
 Normal   Pulled            8m39s                   kubelet, kube-worker-1  Successfully pulled image "busybox:latest"
 Normal   Created           8m38s                   kubelet, kube-worker-1  Created container
 Normal   Started           8m38s                   kubelet, kube-worker-1  Started container
 Normal   Pulling           8m38s                   kubelet, kube-worker-1  pulling image "rabbitmq:3.7-alpine"
 Normal   Pulled            8m27s                   kubelet, kube-worker-1  Successfully pulled image "rabbitmq:3.7-alpine"
 Normal   Pulled            7m27s (x3 over 8m19s)   kubelet, kube-worker-1  Container image "rabbitmq:3.7-alpine" already present on machine
 Normal   Created           7m26s (x4 over 8m21s)   kubelet, kube-worker-1  Created container
 Normal   Started           7m26s (x4 over 8m21s)   kubelet, kube-worker-1  Started container
 Warning  BackOff           3m39s (x26 over 8m17s)  kubelet, kube-worker-1  Back-off restarting failed container

here is the listing of all my pods:

NAME                                                  READY   STATUS             RESTARTS   AGE
happy-scorpion-etcd-0                                 1/1     Running            0          11m
happy-scorpion-etcd-1                                 1/1     Running            0          11m
happy-scorpion-etcd-2                                 1/1     Running            0          11m
happy-scorpion-job-st2-apikey-load-flxrz              0/1     Completed          0          11m
happy-scorpion-job-st2-key-load-v6p28                 0/1     Completed          0          11m
happy-scorpion-mongodb-ha-0                           0/1     Init:2/3           0          11m
happy-scorpion-rabbitmq-ha-0                          0/1     Error              7          11m
happy-scorpion-st2actionrunner-d75665c65-7dzl7        0/1     CrashLoopBackOff   6          11m
happy-scorpion-st2actionrunner-d75665c65-9tgmc        0/1     CrashLoopBackOff   6          11m
happy-scorpion-st2actionrunner-d75665c65-jcgzx        0/1     CrashLoopBackOff   6          11m
happy-scorpion-st2actionrunner-d75665c65-llvqt        0/1     CrashLoopBackOff   6          11m
happy-scorpion-st2actionrunner-d75665c65-pc9zp        0/1     CrashLoopBackOff   6          11m
happy-scorpion-st2api-b7ccf88df-cmp8b                 0/1     CrashLoopBackOff   6          11m
happy-scorpion-st2api-b7ccf88df-qxnhs                 0/1     CrashLoopBackOff   6          11m
happy-scorpion-st2auth-68d498d575-ctkkx               0/1     CrashLoopBackOff   6          11m
happy-scorpion-st2auth-68d498d575-sntn9               0/1     CrashLoopBackOff   6          11m
happy-scorpion-st2client-7b86568969-9kg96             1/1     Running            0          11m
happy-scorpion-st2garbagecollector-67b658c867-jx8kc   0/1     CrashLoopBackOff   6          11m
happy-scorpion-st2notifier-68fdcc698c-9m897           0/1     CrashLoopBackOff   6          11m
happy-scorpion-st2notifier-68fdcc698c-h294l           0/1     CrashLoopBackOff   6          11m
happy-scorpion-st2rulesengine-6976c5cb56-nqmpf        0/1     CrashLoopBackOff   6          11m
happy-scorpion-st2rulesengine-6976c5cb56-rt2l9        0/1     CrashLoopBackOff   6          11m
happy-scorpion-st2scheduler-7f455d6dc6-hnl9q          0/1     CrashLoopBackOff   6          11m
happy-scorpion-st2scheduler-7f455d6dc6-zkgqv          0/1     CrashLoopBackOff   6          11m
happy-scorpion-st2sensorcontainer-bd6487d84-npz7s     0/1     CrashLoopBackOff   6          11m
happy-scorpion-st2stream-64746cddb4-kfxv5             0/1     CrashLoopBackOff   6          11m
happy-scorpion-st2stream-64746cddb4-psmtw             0/1     CrashLoopBackOff   6          11m
(Eugen C.) #10

You definitely need to check logs for failing RabbitMQ pods: what’s going on there and what’s the error message behind the MQ failure.

kubectl logs happy-scorpion-rabbitmq-ha-0

BTW, did mongodb cluster started up eventually or still failing too?

(Warren) #11

Hello. Without knowing the error from the RabbitMQ pod, I’ll hazard a guess. I’ve seen this kind of behavior when there’s a pre-existing pvc from a prior instance of that pod. The pvc exists even after running helm delete <name> --purge.

Run kubectl delete pvc <pvc-name> and then run helm install. If RabbitMQ starts successfully, then I suspect an issue with the PVC. This is how I personally workaround this issue until I have a chance to look into the real solution.

Please let me know if this helps.