While trying to extend a working one-SGM ElasticXL cluster R82 JHF T14, the new SGM was not detected by the SMO (i. e. not available for selection in webui or gclish, nothing in insights' Alerts & Events).
The SMO was receiving the request to join messages from the new-SGM-to-be-added:
[Expert@epm-91-s01-01:0]# tcpdump -vnni any -s 0 -A udp port 1135
13:06:58.210605 IP (tos 0x0, ttl 64, id 37360, offset 0, flags [DF], proto UDP (17), length 772)
192.0.2.254.1135 > 192.0.2.255.1135: UDP, length 744
E.....@.@............o.o....{"hostname": "epm-92", "serial-number": "VMware-564ddfb4a3eca500-789acd5812312d8b", "public-key": "trimmed_for_brevity", "request-id": "a20625ea16376ae5aa0784d28e972bdb", "model": "VMware", "version": "R82", "state": "REQUEST_TO_JOIN"}
So what was exl server doing ? There was not much in /var/log/exl_detectiond.log therefore I had a look in /opt/ElasticXL/exl_detection/src/exl_detectiond.py , and enabled debug (brutally, by changing the default='info' in the line parser.add_argument('--log-level', choices=['info', 'debug', 'error', 'warning'], default='debug', help='Set logging level')).
Did a tellpm process:exl_detectiond ; tellpm process:exl_detectiond t to reload.
Now /var/log/exl_detectiond.log confirmed that the request to join was received and even the database was updated, but the sent message was to itself (i. e. to "member-id": 1,"site-id": 1) after querying the database for members.
So , /usr/libexec/exl_detectiond --purge-db which is simply deleting the keys in redis and restart exl_detectiond (tellpm way) while tailing the log: the autodetection resumed and I was so happy to monitor each and every step of cluster joining in insights (DEBUG flag enabled in Alerts & Events).
Probably this is not the recommended way (if any documented), but I'm using it as an example of how we improved (tremendously if I may say) the relevance of the debug messages and the usability of troubleshooting tools.