Kubernetes CapacityScheduling / ElasticQuota: Namespace-Level Capacity Reservation

Bottom line

Kubernetes CapacityScheduling is an out-of-tree scheduler plugin (KEP #9) that implements namespace-level elastic quotas via the ElasticQuota CRD, allowing workloads to borrow unused capacity from other namespaces while guaranteeing minimum allocations through pod preemption. It remains at "Sample" maturity - the lowest tier in the scheduler-plugins project - and has a known architectural limitation. no cross-node preemption, which can cause resource fragmentation. Independent benchmarks (Purdue PEARC '24) show it can improve cluster utilization by roughly compared to static ResourceQuota and reduce workflow run times by 2–3.4×, though a competing scheduler (YuniKorn) achieves slightly better results in most metrics. For production use, large organizations (Alibaba Cloud, ByteDance via Koordinator) have extended the concept with hierarchical quota trees and eviction policies. The Kubernetes community is increasingly steering batch quota management toward Kueue, a job-queueing controller that works with the default scheduler rather than replacing it.

Key findings

  • Finding: CapacityScheduling implements a min/max elastic quota model per namespace, inspired by Hadoop YARN. When a namespace's usage is below its min, it can preempt pods from namespaces that have borrowed resources above their own min. This is fundamentally different from native ResourceQuota, which only enforces a hard upper bound at admission time.
  • Finding: The plugin's maturity is officially rated 💡 Sample ("for demonstrating and inspiring purpose"). This is explicitly not labeled Alpha, Beta, or Stable, signaling that SIG Scheduling doesn't currently recommend it for unmodified production workloads.
  • Finding: A controlled academic benchmark on a 384-core cluster found that CapacityScheduling increased utilization from 26% (with native ResourceQuotas) to 77% - comparable to YuniKorn's 80%. CapacityScheduling also delivered faster time-to-quota preemption than YuniKorn, but YuniKorn achieved better overall workflow run times and queue-time fairness.
  • Finding: Cross-node preemption isn't supported because the default Kubernetes scheduler lacks this capability. In practice, a pod requiring 2 CPUs may fail to reclaim borrowed resources if those resources are split across multiple nodes, leading to fragmentation. The KEP recommends keeping sum(min) < total cluster capacity to mitigate this.
  • Finding: Major cloud providers have forked or extended the concept. Alibaba Cloud ACK implements ElasticQuotaTree (tree-structured hierarchical quotas with v1beta1 API) and ack-kube-queue for job queueing. Koordinator (open-sourced by Alibaba/ByteDance) adds hierarchical quota groups, shared-weight fairness, and automatic pod revocation when used > runtime.

Background

CapacityScheduling originated in the kubernetes-sigs/scheduler-plugins repository under SIG Scheduling. The design, proposed in KEP #9, was motivated by the need to run batch workloads (especially ML/DL training jobs) more efficiently on Kubernetes. Native ResourceQuota enforces limits at pod creation time based on resource requests in the pod spec, which can lead to low utilization when pods claim resources but fail to schedule or fail at runtime. The "ElasticQuota" concept - with min (guaranteed) and max (upper bound) fields - was borrowed directly from the Hadoop YARN Capacity Scheduler.

The ElasticQuota CRD is namespace-scoped and tracked by the CapacityScheduling plugin through the scheduler framework's PreFilter (validate quota max), PostFilter (preemption logic). Reserve (atomic usage update) extension points. The plugin maintains an in-memory cache of quota state and supports two preemption modes:

  • Cross-namespace: When the preemptor's usage is ≤ its min, it can reclaim resources from namespaces using more than their min.
  • In-namespace: When the preemptor's usage > its min, it can only preempt lower-priority pods within its own namespace.

Current state

As of mid-2026, CapacityScheduling remains in the scheduler-plugins repo with:

  • API version: scheduling.x-k8s.io/v1alpha1
  • Maturity: 💡 Sample
  • Compatibility: Tied to Kubernetes scheduler framework versions; the scheduler-plugins project releases patch versions aligned with k8s client packages (e.G., v0.18.X for k8s 1.18).

The plugin is actively used in production by downstream forks, most notably:

  • Alibaba Cloud ACK: Uses a patched scheduler with CapacityScheduling, ElasticQuotaTree, and ack-kube-queue. Their changelogs show ongoing fixes (e.G., Nov 2024: "Fixed elastic quota preemption being triggered even when ElasticQuotaTree was absent"; Jan 2026: "Optimized ElasticQuota Min/Max Guarantee logic").
  • ByteDance/Alibaba Koordinator: Implements a more advanced hierarchical elastic quota system with pod revocation, multi-quota-tree support, and webhook validation.

Meanwhile, Kueue (introduced in Kubernetes blog Oct 2022, now a mature SIG Scheduling subproject) has become the community's preferred path for batch quota management. Kueue operates at the job level rather than the pod scheduling level, using the suspend field and mutable scheduling directives to gate job admission while delegating actual pod placement to the default kube-scheduler.

Technical or implementation details

ElasticQuota CRD:

apiVersion: scheduling.x-k8s.io/v1alpha1
kind: ElasticQuota
metadata:
  name: quota1
  namespace: quota1
spec:
  max:
    cpu: 6
    memory: 12Gi
  min:
    cpu: 4
    memory: 8Gi

Scheduler configuration:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
  plugins:
    multiPoint:
      enabled:
      - name: CapacityScheduling
    postFilter:
      enabled:
      - name: CapacityScheduling
      disabled:
      - name: "*"

Resource accounting: The plugin calculates pod resource requests as max(sum(containers), max(init_container)) + overhead. It validates both per-namespace max and a global constraint that sum(all used) + newPod.request <= sum(all min) to prevent overcommitment of guaranteed resources.

Preemption victim selection:

  • If preemptor_quota.used + request <= preemptor_quota.min: victims chosen from namespaces where used > min (borrowed resources).
  • If preemptor_quota.used + request > preemptor_quota.min: victims chosen from the same namespace with lower priority.

Known limitation - cross-node preemption: The scheduler can only preempt pods on a single node to make room for a new pod. If borrowed resources are fragmented across nodes, the preemptor may fail to schedule even though enough total resources exist.

Evidence, comparisons, and related context

Approach Architecture Maturity Namespace Quota Hierarchical Quota Replaces Default Scheduler? Key Tradeoff
CapacityScheduling Scheduler plugin Sample Min/Max elastic Flat only Optional Simple, but no cross-node preemption; sample maturity
YuniKorn Replacement scheduler Apache TLP Min/Max + hierarchical queues Yes Yes Best performance in benchmarks; replaces kube-scheduler
Volcano Standalone batch system CNCF incubating Queue-based Queue hierarchy Works alongside Requires VolcanoJob CRD; not native K8s workloads
Kueue Job admission controller Beta/GA features ClusterQueue / LocalQueue Yes (added post-2022) No Native K8s integration; lower operational risk; no gang scheduling initially
Koordinator Scheduler + controller Production (Alibaba) Hierarchical elastic Yes (tree structure) Yes (koord-scheduler) Most feature-complete; Alibaba-scale proven

Benchmarks (Purdue PEARC '24, 384-core cluster, 3 tenants):

  • Utilization: ResourceQuotas = 26%; CapacityScheduling = 77% (3.0×); YuniKorn = 80% (3.1×).
  • Workflow run time vs ResourceQuotas: CapacityScheduling = 2.03–3.44× faster; YuniKorn = 2.43–4.56× faster.
  • Time to guaranteed quota: CapacityScheduling outperformed YuniKorn (3.65× vs 1.11× speedup for tenant2), indicating faster preemption.
  • Fairness: Both schedulers allowed pods to run indefinitely on borrowed resources with no fair-share reclamation mechanism.

Alibaba Cloud extensions:

  • ElasticQuotaTree adds tree-structured quotas with root/children nodes, each binding to namespaces.
  • Supports resourceflavors for node label-based quota binding.
  • kube-queue/max-jobs resource type limits concurrent job execution per quota.

Limitations and critiques

  1. Sample maturity: SIG Scheduling explicitly labels CapacityScheduling as a demonstration/inspiration plugin, not production-grade. Test plans, graduation criteria, and production readiness reviews in KEP #9 are marked TBD.
  2. No cross-node preemption: This is an architectural limitation inherited from the default scheduler. It can leave clusters fragmented and unable to reclaim guaranteed resources.
  3. Fairness gap for over-quota usage: Once a namespace borrows idle resources, there is no mechanism to fairly redistribute those resources among multiple hungry tenants. The ACM study noted this as an open research question.
  4. Flat namespace model: The upstream plugin supports only one ElasticQuota per namespace with no hierarchy. Large organizations need hierarchical delegation (team → project → user), which requires downstream forks like Koordinator or ACK.
  5. API stability: The CRD is at v1alpha1 with no evidence of a v1beta1 or v1 upstream graduation path. Alibaba Cloud uses its own v1beta1 for ElasticQuotaTree, but that isn't the upstream API.
  6. Default value bug (fixed): An early bug (issue #424, 2022) caused max to default to 0 instead of infinite when unspecified. This was fixed in PR #520 (2023), but it illustrates the plugin's evolving implementation state.

Open questions

  • Will CapacityScheduling ever graduate from "Sample" to Alpha/Beta/Stable, or will Kueue subsume its use cases?
  • Will the ElasticQuota CRD be merged into the native ResourceQuota API as suggested in KEP #9, or will it remain a separate CRD?
  • What is the performance and correctness impact of the cross-node preemption limitation at scale (thousands of nodes, tens of thousands of pods)?
  • How do CapacityScheduling and Kueue interact? Can they be used together, or are they mutually exclusive architectural approaches?

Practical takeaways

  • For experimentation / proof-of-concept: CapacityScheduling is a viable way to explore elastic namespace quotas without replacing the entire scheduler. Enable it via scheduler-plugins and define ElasticQuota objects per namespace.
  • For production multi-tenant batch clusters: Consider Kueue if you want native Kubernetes integration without a custom scheduler, or YuniKorn if you need maximum scheduling performance and are willing to replace kube-scheduler. If you're on Alibaba Cloud, use ACK's built-in CapacityScheduling + ElasticQuotaTree.
  • For hierarchical quotas: The upstream plugin doesn't support tree-structured quotas. Use Koordinator or Alibaba Cloud's ElasticQuotaTree if you need parent/child quota delegation.
  • Mitigation for cross-node preemption: Keep sum(all ElasticQuota min) strictly less than total cluster capacity (as recommended in KEP #9) to reduce the chance of fragmented borrowed resources.
  • Monitoring: Track namespace-level used vs min vs max via the ElasticQuota status fields, and alert when namespaces are chronically blocked from reclaiming their guaranteed resources.

Sources used

  • KEP 9 - Capacity Scheduling (scheduler-plugins) - https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/kep/9-capacity-scheduling/README.md
  • Capacity Scheduling Plugin Docs - https://scheduler-plugins.sigs.k8s.io/docs/plugins/capacity-scheduling/
  • CapacityScheduling pkg README - https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/capacityscheduling/README.md
  • DeepWiki Architecture Overview - https://deepwiki.com/kubernetes-sigs/scheduler-plugins/3.3-capacityscheduling-plugin
  • Kubernetes Scheduling: the scheduler-plugins project (Aleskandro) - https://www.aleskandro.com/posts/kubernetes-scheduler-p3-plugins/
  • Evaluation of Kubernetes Schedulers for a Community Cloud Computing Model (ACM PEARC '24) - https://dl.acm.org/doi/fullHtml/10.1145/3626203.3670520
  • GitHub Issue #424 - ElasticQuota default max=0 bug - https://github.com/kubernetes-sigs/scheduler-plugins/issues/424
  • Batch Scheduling on Kubernetes: Comparing Apache YuniKorn, Volcano.Sh, and Kueue (InfraCloud) - https://www.infracloud.io/blogs/batch-scheduling-on-kubernetes/
  • Alibaba Cloud ACK - ElasticQuotaTree documentation - https://help.aliyun.com/zh/ack/ack-managed-and-ack-dedicated/user-guide/elasticquotatree
  • Kueue - Kubernetes-native Job Queueing - https://kueue.sigs.k8s.io/
  • Introducing Kueue (Kubernetes Blog, Oct 2022) - https://kubernetes.io/blog/2022/10/04/introducing-kueue/
  • Koordinator - Hierarchical Elastic Quota - https://koordinator.sh/docs/user-manuals/capacity-scheduling