What’s new with Microsoft in open-source and Kubernetes at KubeCon + CloudNativeCon Europe 2026


There’s a sample in how complicated expertise matures. Early on, groups make their very own decisions: completely different instruments, completely different abstractions, other ways of reasoning about failure. It appears to be like like flexibility however at scale it reveals itself as fragmentation.

The repair is rarely simply extra functionality; it’s shared operational philosophy. Kubernetes proved this. It didn’t simply reply “how will we run containers?” It answered “how do we modify operating programs safely?” The neighborhood constructed these patterns, hardened them, and made them the baseline.

AI infrastructure remains to be within the chaotic section. The shift from “working versus damaged” to “good solutions versus unhealthy solutions” is a essentially completely different operational downside, and it received’t get solved with extra tooling. It will get solved the way in which cloud-native did: open supply creating the shared interfaces and neighborhood strain that exchange particular person judgment with documented, reproducible observe.

That’s what we’re constructing towards. Since my final replace at KubeCon + CloudNativeCon North America 2025, our groups have continued investing throughout open-source AI infrastructure, multi-cluster operations, networking, observability, storage, and cluster lifecycle. At KubeCon + CloudNativeCon Europe 2026 in Amsterdam, we’re sharing a number of bulletins that mirror that very same purpose: deliver the operational maturity of Kubernetes to the workloads and calls for of as we speak.

Constructing the open supply basis for AI on Kubernetes

The convergence of AI and Kubernetes infrastructure implies that gaps in AI infrastructure and gaps in Kubernetes infrastructure are more and more the identical gaps. A major a part of our upstream work this cycle has been constructing the primitives that make GPU-backed workloads first-class residents within the cloud-native ecosystem.

On the scheduling facet, Microsoft has been collaborating with business companions to advance open requirements for {hardware} useful resource administration. Key milestones embrace:

  • Dynamic Useful resource Allocation (DRA) has graduated to common availability, with the DRA instance driver and DRA Admin Entry additionally delivery as a part of that work.
  • Workload Conscious Scheduling for Kubernetes 1.36 provides DRA assist within the Workload API and drives integration into KubeRay, making it extra easy for builders to request and handle high-performance infrastructure for coaching and inference.
  • DRANet now consists of upstream compatibility for Azure RDMA Community Interface Playing cards (NICs), extending DRA-based community useful resource administration to high-performance {hardware} the place GPU-to-NIC topology alignment immediately impacts coaching efficiency.

Past scheduling, we’ve continued investing within the tooling wanted to deploy, function, and safe AI workloads on Kubernetes:

  • AI Runway is a brand new open-source challenge that introduces a standard Kubernetes API for inference workloads, giving platform groups a centralized option to handle mannequin deployments and undertake new serving applied sciences because the ecosystem evolves. It ships with an online interface for customers who shouldn’t must know Kubernetes to deploy a mannequin, together with built-in HuggingFace mannequin discovery, GPU reminiscence match indicators, real-time value estimates, and assist for runtimes together with NVIDIA Dynamo, KubeRay, llm-d, and KAITO.
  • HolmesGPT has joined the Cloud Native Computing Basis (CNCF) as a Sandbox challenge, bringing agentic troubleshooting capabilities into the shared cloud-native tooling ecosystem.
  • Dalec, a newly onboarded CNCF challenge, defines declarative specs for constructing system packages and producing minimal container photos, with assist for SBOM era and provenance attestations at construct time. Lowering assault floor and customary vulnerabilities and exposures on the construct stage issues for any group attempting to run AI workloads responsibly at scale.
  • Cilium additionally obtained a broad set of Microsoft contributions this cycle, together with native mTLS ztunnel assist for sidecarless encrypted workload communication, Hubble metrics cardinality controls for managing observability prices at scale, stream log aggregation to cut back storage quantity, and two merged Cluster Mesh Cilium Function Proposals (CFPs) advancing cross-cluster networking.

What’s new in Azure Kubernetes Service

Along with our upstream contributions, I’m completely satisfied to share new capabilities in Azure Kubernetes Service (AKS) throughout networking and safety, observability, multi-cluster operations, storage, and cluster lifecycle administration.

From IP-based controls to identity-aware networking

As Kubernetes deployments develop extra distributed, IP-based networking turns into tougher to purpose about: visibility degrades, safety insurance policies develop tough to audit, and encrypting workload communication has traditionally required both a full-service mesh or a big quantity of customized work. Our networking updates this cycle shut that hole by shifting safety and site visitors intelligence to the appliance layer, the place it’s each extra significant and simpler to function.

Azure Kubernetes Software Community provides groups mutual TLS, application-aware authorization, and detailed site visitors telemetry throughout ingress and in-cluster communication, with built-in multi-region connectivity. The result’s identity-aware safety and actual site visitors perception with out the overhead of operating a full-service mesh. For groups managing the deprecation of ingress-nginx, Software Routing with Meshless Istio gives a standards-based path ahead: Kubernetes Gateway API assist with out sidecars, continued assist for current ingress-nginx configurations, and contributions to ingress2gateway for groups shifting incrementally.

On the information aircraft stage, WireGuard encryption with the Cilium information aircraft secures node-to-node site visitors effectively and with out utility adjustments. Cilium mTLS in Superior Container Networking Providers extends that to pod-to-pod communication utilizing X.509 certificates and SPIRE for id administration: authenticated, encrypted workload site visitors with out sidecars. Rounding this out, Pod CIDR growth removes a long-standing operational constraint by permitting clusters to develop their pod IP ranges in place slightly than requiring a rebuild, and directors can now disable HTTP proxy variables for nodes and pods with out touching management aircraft configuration.

Visibility that matches the complexity of contemporary clusters

Working Kubernetes at scale is just manageable with clear, constant visibility into infrastructure, networking, and workloads. Two persistent gaps we’ve been closing are GPU telemetry and community site visitors observability, each of which turn out to be extra vital as AI workloads transfer into manufacturing.

Groups operating GPU workloads have typically had a big monitoring blind spot: GPU utilization merely wasn’t seen alongside commonplace Kubernetes metrics with out guide exporter configuration. AKS now surfaces GPU efficiency and utilization immediately into managed Prometheus and Grafana, placing GPU telemetry into the identical stack groups are already utilizing for capability planning and alerting. On the community facet, per-flow L3/L4 and supported L7 visibility throughout HTTP, gRPC, and Kafka site visitors is now obtainable, together with IPs, ports, workloads, stream route, and coverage selections, with a new Azure Monitor expertise that brings built-in dashboards and one-click onboarding. For groups coping with the inverse downside (metric quantity slightly than metric gaps) operators can now dynamically management which container-level metrics are collected utilizing Kubernetes customized assets, protecting dashboards centered on actionable indicators. Agentic container networking provides a web-based interface that interprets natural-language queries into read-only diagnostics utilizing reside telemetry, shortening the trail from “one thing’s flawed” to “right here’s what to do about it.”

Less complicated operations throughout clusters and workloads

For organizations operating workloads throughout a number of clusters, cross-cluster networking has traditionally meant customized plumbing, inconsistent service discovery, and restricted visibility throughout cluster boundaries. Azure Kubernetes Fleet Supervisor now addresses this with cross-cluster networking by a managed Cilium cluster mesh, offering unified connectivity throughout AKS clusters, a world service registry for cross-cluster service discovery, and clever routing with configuration managed centrally slightly than repeated per cluster.

On the storage facet, clusters can now devour storage from a shared Elastic SAN pool slightly than provisioning and managing particular person disks per workload. This simplifies capability planning for stateful workloads with variable calls for and reduces provisioning overhead at scale.

For groups that want a extra accessible entry level to Kubernetes itself, AKS desktop is now typically obtainable. It brings a full AKS expertise to your desktop, making it easy for builders to run, check, and iterate on Kubernetes workloads regionally with the identical configuration they’ll use in manufacturing.

Safer upgrades and sooner restoration

The price of a foul improve compounds rapidly in manufacturing, and restoration from one has traditionally been time-consuming and hectic. A number of updates this cycle focus particularly on making cluster adjustments safer, extra observable, and extra reversible.

Blue-green agent pool upgrades create a parallel pool with the brand new configuration slightly than making use of adjustments in place, so groups can validate conduct earlier than shifting site visitors and preserve a transparent rollback path if one thing appears to be like flawed. Agent pool rollback enhances this by permitting groups to revert a node pool to its earlier Kubernetes model and node picture when issues floor after an improve (with no full rebuild). Collectively, these give operators significant management over the improve lifecycle slightly than a selection between “improve and hope” or “keep behind.” For sooner provisioning throughout scale-out occasions, ready picture specification lets groups outline customized node photos with preloaded containers, working system settings, and initialization scripts, decreasing startup time and enhancing consistency for environments that want speedy, repeatable provisioning.

Join with the Microsoft Azure workforce in Amsterdam

The Azure workforce are excited to be at KubeCon + CloudNativeCon Europe 2026. Just a few highlights of the place to attach with the Azure workforce on the bottom:

Comfortable KubeCon + CloudNativeCon!



Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles