Kubernetes Chronicles 1 — NiFi node refuses to rejoin cluster

Kubernetes is awesome. It is all that it is made out to be and sometimes even more. But it’s not always rainbows and butterflies. With these series, I hope to chronicle my experiences running workloads on Kubernetes (AWS EKS). While I’ll be writing stuff that is very specific to the technology stack at my current workplace, I’m hoping that some of what I document here may be useful to folks working on other tech stacks too.

So among other things, we use Apache NiFi (fantastic tool) to build data flows. We have a three-node cluster that is deployed on Kubernetes. I had written about the setup previously here. What is important from that post is that as part of the deployment, we also deploy a nifi-ca server. This is essentially a NiFi container that runs the tls-toolkit in server mode. When any NiFi pod starts up, it uses it’s tls-toolkit in client mode to send a CSR to nifi-ca and gets a signed certificate back. This certificate is used by each NiFi node to authenticate itself when it communicates with other nodes or when it is trying to join a cluster.

We configured the NiFi pods so that they run on dedicated machines and only one NiFi pod can run on one Kubernetes node (node affinity, pod anti-affinity). We didn’t put any restrictions on the nifi-ca pod excepting that there is always at least a single instance of the nifi-ca pod running (deployed as a replica set with replica count set to 1).

Every now and then, the kubelet on any node in the Kubernetes cluster starts to act up and becomes unreachable to the k8s master — and this causes the node to be tainted and unschedulable — effectively taking the node down. Every now and then, the node that goes down happens to be a node that is running a NiFi pod. No biggie — all we have to do is figure out a way of restarting the kubelet or terminate the underlying EC2 instance so a new one can attach itself to the worker node group. We prefer to do the latter as it is much easier. (Restarting the kubelet would require privileged containers to chroot into the worker node’s file system and run privileged commands — not the most secure way of doing things). So now a new worker node joins the Kubernetes cluster, the NiFi pod that was waiting to be scheduled is now scheduled on the new node, it boots up, joins the cluster, and our data flows are back up and running.

So I thought I had this all figured out until one time, after a NiFi pod recovered from an incident like the one described above, it refused to join the existing cluster. The logs were filled with the below entries —

WARN [Clustering Tasks Thread-1] o.apache.nifi.controller.FlowController Failed to send heartbeat due to: org.apache.nifi.cluster.protocol.ProtocolException: Failed marshalling 'HEARTBEAT' protocol message due to: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors

Clearly, this is NOT a problem of discovery. The NiFi pod IS able to connect to Zookeeper and knows where it needs to send a heartbeat to. It is trying to send the heartbeat to the cluster leader but is unable to because of a failed SSL Handshake. WTAF.

Going through some of NiFi’s mail archives (this specifically) I figured that I have to look at the owner of the trust store to understand what might be going wrong. So I did. I ran the following command on all three NiFi pods —

keytool --list -v -keystore conf/keystore.jks

The above command listed out the certificates present in the Keystore. The first certificate was owned by the node. The second certificate (I assume this is the certificate that was used to sign the certificate) is owned by nifi-ca. I compared the ownership details of the nifi-ca (Serial Number & MD5 fingerprints) certificate across all the nodes. Here’s what I got —

Serial Numbers - 
nifi-0: 1719cf1419e00000000
nifi-1: 1719cf1419e00000000
nifi-2: 171a84b0e8c00000000 (different from the first two)
MD5 -
nifi-0: 40:93:D5:EE:B8:F6:AA:FD:7B:37:2A:50:D6:EA:8C:05
nifi-1: 40:93:D5:EE:B8:F6:AA:FD:7B:37:2A:50:D6:EA:8C:05
nifi-2: 44:15:78:60:CE:5C:19:38:A6:84:04:6F:6A:75:5A:B8 (different from the first two)

nifi-2 , the NiFi pod that got evicted and rescheduled had different certificate fingerprints from the other two pods and that’s the reason it was not able to send the heartbeat to the cluster leader due to an SSL Handshake failure. What gives?

So I sifted through the Kubernetes events and noticed that the nifi-2 and the nifi-ca pods were initially scheduled on the same worker node. When that worker node went down, the nifi-ca pod was evicted and scheduled on a different node. The nifi-ca got effectively rebooted on a different machine. The nifi-2 pod now requests for a signed certificate from a newly restarted nifi-ca . The nifi-ca does issue a signed certificate to nifi-2 , but with a different fingerprint and this is what was causing the SSL Handshake issue. The only way I could think of fixing this issue was to restart the other two NiFi pods as well. In doing so, the fingerprint “drift” was eliminated and all three pods were able to successfully create a cluster.

The lesson learnt here is that the nifi-ca pod should not be scheduled on the same nodes as the NiFi pods. This way, regardless of which NiFi pod goes down, there will not be any drift in the signed certificate’s fingerprint. There is always the possibility that the nifi-ca pod can still be evicted and scheduled on a different node and cause a “fingerprint drift”. I’m still trying to figure out how this can be avoided or the best way to recover if that happens. Any thoughts on this would be much appreciated.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store