We forgot to tag a Kubernetes namespace. Zero Trust broke.
I was deep in a review of our network segmentation policies in AWS last week. We’ve got the big picture covered with VPCs and security groups, but the real granular control happens inside Kubernetes. We’ve got tools like GuardDuty flagging suspicious traffic, which is great. But the actual foundation of our Zero Trust network is built on microsegmentation.
Our core principle is simple: nobody talks to nobody unless we explicitly allow it. That’s the Zero Trust dream, right? A key part of how we manage this is by tagging our Kubernetes namespaces. Prod namespaces get a tag like env:prod, dev gets env:dev, and so on. We then use these tags in our network policies. If a namespace doesn’t have the right tag, it can’t talk to anything critical.
So, I was digging into some recent alerts. We had traffic between services in prod-a and prod-b namespaces getting blocked. This was a red flag. Both are critical production namespaces, and they absolutely need to communicate.
My first thought was to check the network policy for prod-a. It clearly stated it allowed egress traffic to any pod with the env:prod tag. That looked correct. Then I checked the policy for prod-b. It allowed ingress traffic from any pod with the env:prod tag. Also looked correct.
Yet, the traffic was still getting blocked. This was weird. We use Calico for network policy enforcement within Kubernetes. It’s a robust tool, and it relies on labels applied to pods and namespaces. The policies should have been hitting.
I started drilling down, checking individual pods within prod-a and prod-b. Their labels seemed fine. Then I double-checked the labels on the namespaces themselves. prod-a had env:prod. prod-b also had env:prod. Everything seemed to be in order.
Then it hit me. I was only looking at the namespaces that were involved in the communication. What about the namespace that was initiating the connection to prod-a?
It wasn’t prod-b. The alerts showed traffic originating from a different source. It was a new internal tool we had just deployed. Let’s call it ops-utils. It’s a handy little utility for some backend tasks.
I pulled up the details for the ops-utils namespace. And there it was. No env tag. Nothing. It was completely untagged. A completely new, unlabeled entity in our cluster.
Calico, dutifully enforcing our policy, saw traffic attempting to flow from ops-utils to prod-a. The network policy for prod-a explicitly said, “only allow traffic originating from namespaces with the env:prod tag.” Since ops-utils had no env tag at all, it didn’t match the criteria. Blocked.
It was a simple oversight. We had followed the standard deployment checklist for ops-utils, but somewhere along the line, the network policy requirement – applying the correct env tag – got missed. It’s a stark reminder that in Zero Trust, it’s not just about having the policies in place; it’s about ensuring every component adheres to them. A missing label can completely undermine your segmentation.
This experience reinforced a critical point: Zero Trust isn’t a static product you install. It’s an ongoing process, a discipline. And sometimes, that discipline is tested by the smallest of things, like a forgotten tag. You feel like you’ve locked down the entire fortress, only to realize you left the side gate unlatched because the tag that secured it fell off.
We immediately added the env:prod tag to the ops-utils namespace. The alerts stopped, and normal communication resumed. All systems were go. But it was one of those moments that makes your heart skip a beat.
To prevent this from happening again, we’ve implemented a stricter pre-deployment validation. Any attempt to deploy a new namespace without a valid env tag will now be automatically rejected. It adds a tiny bit of friction to the deployment process, sure, but it’s a small price to pay for peace of mind. It’s infinitely less stressful than having to explain to management why a new internal utility inadvertently caused a production outage.
Recommended Reading
- Kubernetes: Up & Running, 3rd Edition - FALCONS-EDGE-20
- The Practice of Cloud System Administration - FALCONS-EDGE-20
- Cloud Native DevOps with Kubernetes - FALCONS-EDGE-20
- Mastering Cloud Native Development - FALCONS-EDGE-20
- A Cloud Native Approach to Security - FALCONS-EDGE-20
- Site Reliability Engineering: How Google Runs Production Systems - FALCONS-EDGE-20
- Building, Testing, and Deploying: A DevOps Handbook for the Cloud Native Era - FALCONS-EDGE-20
- Cloud Native Infrastructure - FALCONS-EDGE-20