CentOS 8 Problems with rabbitmq connections when sensor dispatches trigger

I have the following setup on CentOS 7 which works fine:

  • Single node install of StackStorm 3.2.0
  • Ghost2Logger pack installed
  • Rule setup for syslog entry
    When a syslog entry comes in from my remote system that matches, then the Ghost2Watcher sensor matches, and the action from my rule is run

On CentOS 8 the same situation then the action for the rule is never found. The sensor code still calls sensor_service dispatch - but no action is taken.

In the st2sensorcontainer.log then I can see that the sensor has matched:

2020-07-06 13:10:52,921 140605464730664 INFO (unknown file) [-] [Ghost2logger]: Received Matched Syslog Event
2020-07-06 13:10:52,922 140605464730664 INFO (unknown file) [-] {‘host’: ‘10.0.5.52’, ‘hostpattern’: ‘10.0.5.52’, ‘message’: ‘<13>Jul 6 13:11:03 vyos commit: Successful change to active configuration by user vyos on /dev/pts/0’, ‘pattern’: ‘Successful change to active configuration’}

HOWEVER at the time of the problem I see errors in the rabbitmq logs:

2020-07-06 13:10:52.920 [debug] <0.12744.0> Supervisor {<0.12744.0>,rabbit_connection_sup} started rabbit_connection_helper_sup:start_link() at pid <0.12745.0>
2020-07-06 13:10:52.920 [debug] <0.12744.0> Supervisor {<0.12744.0>,rabbit_connection_sup} started rabbit_reader:start_link(<0.12745.0>, {acceptor,{127,0,0,1},5672}) at pid <0.12746.0>
2020-07-06 13:10:52.925 [info] <0.12746.0> accepting AMQP connection <0.12746.0> (127.0.0.1:36080 -> 127.0.0.1:5672)
2020-07-06 13:11:00.805 [debug] <0.434.0> Asked to [re-]register this node (rabbit@centos-8) with epmd...
2020-07-06 13:11:01.051 [debug] <0.434.0> [Re-]registered this node (rabbit@centos-8) with epmd at port 25672
2020-07-06 13:11:12.922 [error] <0.12746.0> closing AMQP connection <0.12746.0> (127.0.0.1:36080 -> 127.0.0.1:5672):
{handshake_timeout,frame_header}

I have rabbitmq in debug and increased the handshake timeout.

From a TCP dump I can see that the AMQP messages ended at:

  • port 5672 -> 36080 Connection.Start
  • port 36080 -> 5672 TCP ACK
    The Connection.Start-Ok is never sent from port 36080 to 5672. So this seems to be the missing step.

I can see other RabbitMQ connections being established correctly, and I have an interval timer running correctly. I have monitored other RabbitMQ connections after that time and can see that the Connection.Start-Ok is sent correctly, but I’m not sure what service they are being established for.

I have run the sensorcontainer manually in debug as well, and could then see:

2020-07-06 11:54:51,180 139913310174248 INFO (unknown file) [-] [Ghost2logger]: Received Matched Syslog Event
2020-07-06 11:54:51,181 139913310174248 INFO (unknown file) [-] {'host': '10.0.5.52', 'hostpattern': '10.0.5.52', 'message': '<13>Jul  6 11:55:01 vyos commit: Successful change to active configuration by user vyos on /dev/pts/0', 'pattern': 'Successful change to active configuration'}
2020-07-06 11:54:51,182 139913310174248 DEBUG (unknown file) [-] [ghost2logger_sensor]: processing request {'host': '10.0.5.52', 'hostpattern': '10.0.5.52', 'message': '<13>Jul  6 11:55:01 vyos commit: Successful change to active configuration by user vyos on /dev/pts/0', 'pattern': 'Successful change to active configuration'}
2020-07-06 11:54:51,182 139913310174248 DEBUG (unknown file) [-] Added trace_context None to trigger ghost2logger.pattern_match.
2020-07-06 11:54:51,186 139913310173944 DEBUG (unknown file) [-] Found existing trigger: TriggerDB(description=None, id=5f02fe9ab0af60d39492799f, metadata_file=None, name="pattern_match", pack="ghost2logger", parameters={}, ref="ghost2logger.pattern_match", ref_count=0, type="ghost2logger.pattern_match", uid="trigger:ghost2logger:pattern_match:99914b932bd37a50b983c5e7c90ae93b") in db.
2020-07-06 11:54:51,187 139913310173944 DEBUG (unknown file) [-] Calling sensor "add_trigger" method (trigger.type=ghost2logger.pattern_match)

In /var/log/messages when the 2nd and following matching triggers are sent, then I see this message:

2020-07-06 12:01:50,402 139913310173792 DEBUG connection [-] Start from server, version: 0.9, properties: {‘capabilities’: {‘publisher_confirms’: True, ‘exchange_exchange_bindings’: True, ‘basic.nack’: True, ‘consumer_cancel_notify’: True, ‘connection.blocked’: True, ‘consumer_priorities’: True, ‘authentication_failure_close’: True, ‘per_consumer_qos’: True, ‘direct_reply_to’: True}, ‘cluster_name’: ‘rabbit@centos-8’, ‘copyright’: ‘Copyright © 2007-2020 VMware, Inc. or its affiliates.’, ‘information’: ‘Licensed under the MPL 1.1. Website: https://rabbitmq.com’, ‘platform’: ‘Erlang/OTP 22.0.7’, ‘product’: ‘RabbitMQ’, ‘version’: ‘3.8.5’}, mechanisms: [b’PLAIN’, b’AMQPLAIN’], locales: [‘en_US’]
2020-07-06 12:01:50,404 139913310173792 WARNING mixins [-] Broker connection error, trying again in 2.0 seconds: ConnectionResetError(104, ‘Connection reset by peer’).
Traceback (most recent call last):
File “/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/utils/functional.py”, line 343, in retry_over_time
return fun(*args, **kwargs)
File “/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/connection.py”, line 283, in connect
return self.connection
File “/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/connection.py”, line 837, in connection
self._connection = self._establish_connection()
File “/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/connection.py”, line 792, in _establish_connection
conn = self.transport.establish_connection()
File “/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/transport/pyamqp.py”, line 130, in establish_connection
conn.connect()
File “/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/connection.py”, line 317, in connect
self.drain_events(timeout=self.connect_timeout)
File “/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/connection.py”, line 505, in drain_events
while not self.blocking_read(timeout):
File “/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/connection.py”, line 511, in blocking_read
return self.on_inbound_frame(frame)
File “/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/method_framing.py”, line 55, in on_frame
callback(channel, method_sig, buf, None)
File “/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/connection.py”, line 518, in on_inbound_method
method_sig, payload, content,
File “/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/abstract_channel.py”, line 145, in dispatch_method
listener(*args)
File “/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/connection.py”, line 406, in _on_start
login_response, self.locale),
File “/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/abstract_channel.py”, line 59, in send_method
conn.frame_writer(1, self.channel_id, sig, args, content)
File “/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/method_framing.py”, line 172, in write_frame
write(view[:offset])
File “/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/transport.py”, line 284, in write
self._write(s)
File “/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/greenio/base.py”, line 403, in sendall
tail = self.send(data, flags)
File “/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/greenio/base.py”, line 397, in send
return self._send_loop(self.fd.send, data, flags)
File “/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/greenio/base.py”, line 384, in _send_loop
return send_method(data, *args)
ConnectionResetError: [Errno 104] Connection reset by peer

The ConnectionResetError seems to happen on my 2nd syslog match. So it appears as if it causes the preceding connection to be closed.

I have tried restarting rabbitmq, restarting st2sensorcontainer etc. It doesn’t appear to be related to this particular pack unless it is doing something wrong with how it dispatches the trigger. BUT everything works fine on CentOS7 - though that is an earlier version of RabbitMQ.

Have tried a few re-installs on fresh instances, and same problem consistently on CentOS 8.1 and CentOS 8.2.

Interesting!

What if you connect CentOS8 ST2 to an external RabbitMQ running on CentOS7?

Wondering if it’s RMQ is guilty or ST2 in EL8 environment.

@armab That’s a good suggestion. I’ll give that a try. Should help to narrow it down.

I’ve been adding in extra debug today, and that just makes it even more confusing.

What seems to happen (and I’ve just put the extra debugged code onto CentOS 7) to compare.

Is that there seems to be some problems on the query.

So I have the following additional debug in st2common/persistence/base.py in the ContentPackResource:

def get_by_ref(cls, ref):
    LOG.debug("AMANDA CPR get by ref")
    if not ref:
        return None

    ref_obj = ResourceReference.from_string_reference(ref=ref)
    LOG.debug("AMANDA CPR calling query " + ref_obj.name + " pack " + ref_obj.pack)
    try:
        result = cls.query(name=ref_obj.name,
                       pack=ref_obj.pack).first()
        LOG.debug("AMANDA CPR called query")
        return result
    finally:
        LOG.debug("AMANDA Finally of CRP query")

On my CentOS7 - as expected - I get:

    2020-07-07 15:43:18,975 140096997274000 DEBUG base [-] AMANDA CPR calling query pattern_match pack ghost2logger
    2020-07-07 15:43:18,978 140096997274000 DEBUG base [-] AMANDA CPR called query
    2020-07-07 15:43:18,978 140096997274000 DEBUG base [-] AMANDA Finally of CRP query

On my CentOS8 system I get:

2020-07-07 15:43:23,043 139904407141720 DEBUG (unknown file) [-] AMANDA CPR calling query pattern_match pack ghost2logger

and then never any more for that thread…

So I’m wondering if the RabbiqMQ errors are a side affect, and its a problem when it does the db query.

I put mongodb up to debug level 1, and can’t see any errors. But I do see:

2020-07-07T15:43:23.046+0000 I COMMAND [conn102] command st2.trigger_type_d_b command: find { find: “trigger_type_d_b”, filter: { name: “pattern_match”, pack: “ghost2logger” }, limit: 1, singleBatch: true, lsid: { id: UUID(“5a4f293e-1e8a-4ce3-a879-628a149f8e21”) }, $db: “st2”, $readPreference: { mode: “primaryPreferred” } } planSummary: IXSCAN { pack: 1, name: 1 } keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nreturned:1 reslen:618 locks:{ Global: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } storage:{} protocol:op_msg 0ms

which looks ok.

I have mongodb 3.4 on my CentOS7 and mongodb 4.0 on CentOS8. So I could try with mongodb and rabbitmq both externally on a CentOS7 system and see what happens.

No joy there.

CentOS7 node with RabbitMQ and MongoDB installed
CentOS8 node with all other StackStorm components.

Same problem in that when the sensor tries to send in the trigger, the dispatch service doesn’t seem to complete.

Get stuck with the same messages:

  • RabbitMQ reports handshake_timeout
  • Adding extra debug makes it look like it never finishes the cls.query call made by ContentPackResource in
    st2common/persistence/base.py

Based on that information, it could be Python3 vs Python2 environment specific as EL7 runs with py2, while EL8 is on py3.
Ubuntu 18 which is part of our OS support matrix bundles StackStorm with python3 virtualenv too.

Will repeat test on Ubuntu 18.

Still not sure if rabbitmq or mongodb interaction is the key. Adding in debug I have a problem in the ContentPackResource where it gets to this bit:
result = cls.query(name=ref_obj.name,
pack=ref_obj.pack).first()
Further debug shows that the query succeeded, but the call to first doesn’t return. Nothing coming out with try/finally. So not sure if thread is just exiting.

Confirmed with extra debug size of result set is 1.

Not sure if the mongo debug or the rabbit errors are red-herrings, as both happening in parallel.

Is there any description of what logic is supposed to happen with sensor_service.dispatch - in particular with the mongodb and rabbitmq interactions. As I’m not sure which of the two errors I’m seeing is likely to be the leading cause, and which is a side-effect.

Went back to basics and just created a basic sensor that extended Sensor with the example from the stackstorm page, and all went ok and triggered correctly.
I see that the ghost2logger sensor uses the eventlet.monkey_patch call - and there have been some problems reported with that for more recent pythons.
So will concentrate on debugging ghost2logger and what it is doing with threads etc.

Will also try putting the ghost2logger service itself on CentOS 7.

1 Like

Problem still exists if run the Ghost2Logger service on CentOS7
Problem still exists if send in a curl request to the Ghost2Logger Service.

Given above - then that would suggest its nothing to do with the Ghost2Logger binary itself.

Installed the example flask sensor to prove that the environment was working with other sensors that were Flask Apps, and all was ok.

Will try and re-model the Ghost2Logger service based on the examples echo_flask_app.

If the sensor is changed to run the Flask app with threaded=False then when invoked by curl there isn’t a problem.

Problem resolved:

1 Like

@armab Thanks for your help, see last update. It turns out that if you have a sensor, that has a Flask app with threaded=true then the problem I was seeing on ghost2logger, manifests itself on python 3.

However applying @punkrokk’s fix that is in progress that moves the monkey_patch placement in the ST2 code resolves the issue.

@amanda11 Thanks for sharing your findings and the solution!

I couldn’t provide any help here, glad Fixes: #4875 monkeypatch st2 sensor earlier by punkrokk · Pull Request #4976 · StackStorm/st2 · GitHub by @punkrokk fixes it. Please also include in that Github issue a note about this Forum thread.

1 Like