Kubernetes CapacityScheduling / ElasticQuota: Namespace-Level Capacity Reservation
Bottom line
Kubernetes CapacityScheduling is an out-of-tree scheduler plugin (KEP #9) that implements namespace-level elastic quotas via the ElasticQuota CRD, allowing workloads to borrow unused capacity from other namespaces while guaranteeing minimum allocations through pod preemption. It remains at "Sample" maturity - the lowest tier in the scheduler-plugins project - and has a known architectural limitation. no cross-node preemption, which can cause resource fragmentation. Independent benchmarks (Purdue PEARC '24) show it can improve cluster utilization by roughly 3× compared to static ResourceQuota and reduce workflow run times by 2–3.4×, though a competing scheduler (YuniKorn) achieves slightly better results in most metrics. For production use, large organizations (Alibaba Cloud, ByteDance via Koordinator) have extended the concept with hierarchical quota trees and eviction policies. The Kubernetes community is increasingly steering batch quota management toward Kueue, a job-queueing controller that works with the default scheduler rather than replacing it.
Key findings
- Finding: CapacityScheduling implements a
min/maxelastic quota model per namespace, inspired by Hadoop YARN. When a namespace's usage is below itsmin, it can preempt pods from namespaces that have borrowed resources above their ownmin. This is fundamentally different from nativeResourceQuota, which only enforces a hard upper bound at admission time. - Finding: The plugin's maturity is officially rated 💡 Sample ("for demonstrating and inspiring purpose"). This is explicitly not labeled Alpha, Beta, or Stable, signaling that SIG Scheduling doesn't currently recommend it for unmodified production workloads.
- Finding: A controlled academic benchmark on a 384-core cluster found that CapacityScheduling increased utilization from 26% (with native ResourceQuotas) to 77% - comparable to YuniKorn's 80%. CapacityScheduling also delivered faster time-to-quota preemption than YuniKorn, but YuniKorn achieved better overall workflow run times and queue-time fairness.
- Finding: Cross-node preemption isn't supported because the default Kubernetes scheduler lacks this capability. In practice, a pod requiring 2 CPUs may fail to reclaim borrowed resources if those resources are split across multiple nodes, leading to fragmentation. The KEP recommends keeping
sum(min) < total cluster capacityto mitigate this. - Finding: Major cloud providers have forked or extended the concept. Alibaba Cloud ACK implements
ElasticQuotaTree(tree-structured hierarchical quotas withv1beta1API) andack-kube-queuefor job queueing. Koordinator (open-sourced by Alibaba/ByteDance) adds hierarchical quota groups, shared-weight fairness, and automatic pod revocation whenused > runtime.
Background
CapacityScheduling originated in the kubernetes-sigs/scheduler-plugins repository under SIG Scheduling. The design, proposed in KEP #9, was motivated by the need to run batch workloads (especially ML/DL training jobs) more efficiently on Kubernetes. Native ResourceQuota enforces limits at pod creation time based on resource requests in the pod spec, which can lead to low utilization when pods claim resources but fail to schedule or fail at runtime. The "ElasticQuota" concept - with min (guaranteed) and max (upper bound) fields - was borrowed directly from the Hadoop YARN Capacity Scheduler.
The ElasticQuota CRD is namespace-scoped and tracked by the CapacityScheduling plugin through the scheduler framework's PreFilter (validate quota max), PostFilter (preemption logic). Reserve (atomic usage update) extension points. The plugin maintains an in-memory cache of quota state and supports two preemption modes:
- Cross-namespace: When the preemptor's usage is ≤ its
min, it can reclaim resources from namespaces using more than theirmin. - In-namespace: When the preemptor's usage > its
min, it can only preempt lower-priority pods within its own namespace.
Current state
As of mid-2026, CapacityScheduling remains in the scheduler-plugins repo with:
- API version:
scheduling.x-k8s.io/v1alpha1 - Maturity: 💡 Sample
- Compatibility: Tied to Kubernetes scheduler framework versions; the scheduler-plugins project releases patch versions aligned with k8s client packages (e.G., v0.18.X for k8s 1.18).
The plugin is actively used in production by downstream forks, most notably:
- Alibaba Cloud ACK: Uses a patched scheduler with CapacityScheduling,
ElasticQuotaTree, andack-kube-queue. Their changelogs show ongoing fixes (e.G., Nov 2024: "Fixed elastic quota preemption being triggered even when ElasticQuotaTree was absent"; Jan 2026: "Optimized ElasticQuota Min/Max Guarantee logic"). - ByteDance/Alibaba Koordinator: Implements a more advanced hierarchical elastic quota system with pod revocation, multi-quota-tree support, and webhook validation.
Meanwhile, Kueue (introduced in Kubernetes blog Oct 2022, now a mature SIG Scheduling subproject) has become the community's preferred path for batch quota management. Kueue operates at the job level rather than the pod scheduling level, using the suspend field and mutable scheduling directives to gate job admission while delegating actual pod placement to the default kube-scheduler.
Technical or implementation details
ElasticQuota CRD:
apiVersion: scheduling.x-k8s.io/v1alpha1
kind: ElasticQuota
metadata:
name: quota1
namespace: quota1
spec:
max:
cpu: 6
memory: 12Gi
min:
cpu: 4
memory: 8Gi
Scheduler configuration:
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
plugins:
multiPoint:
enabled:
- name: CapacityScheduling
postFilter:
enabled:
- name: CapacityScheduling
disabled:
- name: "*"
Resource accounting: The plugin calculates pod resource requests as max(sum(containers), max(init_container)) + overhead. It validates both per-namespace max and a global constraint that sum(all used) + newPod.request <= sum(all min) to prevent overcommitment of guaranteed resources.
Preemption victim selection:
- If
preemptor_quota.used + request <= preemptor_quota.min: victims chosen from namespaces whereused > min(borrowed resources). - If
preemptor_quota.used + request > preemptor_quota.min: victims chosen from the same namespace with lower priority.
Known limitation - cross-node preemption: The scheduler can only preempt pods on a single node to make room for a new pod. If borrowed resources are fragmented across nodes, the preemptor may fail to schedule even though enough total resources exist.
Evidence, comparisons, and related context
| Approach | Architecture | Maturity | Namespace Quota | Hierarchical Quota | Replaces Default Scheduler? | Key Tradeoff |
|---|---|---|---|---|---|---|
| CapacityScheduling | Scheduler plugin | Sample | Min/Max elastic | Flat only | Optional | Simple, but no cross-node preemption; sample maturity |
| YuniKorn | Replacement scheduler | Apache TLP | Min/Max + hierarchical queues | Yes | Yes | Best performance in benchmarks; replaces kube-scheduler |
| Volcano | Standalone batch system | CNCF incubating | Queue-based | Queue hierarchy | Works alongside | Requires VolcanoJob CRD; not native K8s workloads |
| Kueue | Job admission controller | Beta/GA features | ClusterQueue / LocalQueue | Yes (added post-2022) | No | Native K8s integration; lower operational risk; no gang scheduling initially |
| Koordinator | Scheduler + controller | Production (Alibaba) | Hierarchical elastic | Yes (tree structure) | Yes (koord-scheduler) | Most feature-complete; Alibaba-scale proven |
Benchmarks (Purdue PEARC '24, 384-core cluster, 3 tenants):
- Utilization: ResourceQuotas = 26%; CapacityScheduling = 77% (3.0×); YuniKorn = 80% (3.1×).
- Workflow run time vs ResourceQuotas: CapacityScheduling = 2.03–3.44× faster; YuniKorn = 2.43–4.56× faster.
- Time to guaranteed quota: CapacityScheduling outperformed YuniKorn (3.65× vs 1.11× speedup for tenant2), indicating faster preemption.
- Fairness: Both schedulers allowed pods to run indefinitely on borrowed resources with no fair-share reclamation mechanism.
Alibaba Cloud extensions:
ElasticQuotaTreeadds tree-structured quotas with root/children nodes, each binding to namespaces.- Supports
resourceflavorsfor node label-based quota binding. kube-queue/max-jobsresource type limits concurrent job execution per quota.
Limitations and critiques
- Sample maturity: SIG Scheduling explicitly labels CapacityScheduling as a demonstration/inspiration plugin, not production-grade. Test plans, graduation criteria, and production readiness reviews in KEP #9 are marked TBD.
- No cross-node preemption: This is an architectural limitation inherited from the default scheduler. It can leave clusters fragmented and unable to reclaim guaranteed resources.
- Fairness gap for over-quota usage: Once a namespace borrows idle resources, there is no mechanism to fairly redistribute those resources among multiple hungry tenants. The ACM study noted this as an open research question.
- Flat namespace model: The upstream plugin supports only one ElasticQuota per namespace with no hierarchy. Large organizations need hierarchical delegation (team → project → user), which requires downstream forks like Koordinator or ACK.
- API stability: The CRD is at
v1alpha1with no evidence of av1beta1orv1upstream graduation path. Alibaba Cloud uses its ownv1beta1for ElasticQuotaTree, but that isn't the upstream API. - Default value bug (fixed): An early bug (issue #424, 2022) caused
maxto default to 0 instead of infinite when unspecified. This was fixed in PR #520 (2023), but it illustrates the plugin's evolving implementation state.
Open questions
- Will CapacityScheduling ever graduate from "Sample" to Alpha/Beta/Stable, or will Kueue subsume its use cases?
- Will the ElasticQuota CRD be merged into the native
ResourceQuotaAPI as suggested in KEP #9, or will it remain a separate CRD? - What is the performance and correctness impact of the cross-node preemption limitation at scale (thousands of nodes, tens of thousands of pods)?
- How do CapacityScheduling and Kueue interact? Can they be used together, or are they mutually exclusive architectural approaches?
Practical takeaways
- For experimentation / proof-of-concept: CapacityScheduling is a viable way to explore elastic namespace quotas without replacing the entire scheduler. Enable it via scheduler-plugins and define
ElasticQuotaobjects per namespace. - For production multi-tenant batch clusters: Consider Kueue if you want native Kubernetes integration without a custom scheduler, or YuniKorn if you need maximum scheduling performance and are willing to replace kube-scheduler. If you're on Alibaba Cloud, use ACK's built-in CapacityScheduling + ElasticQuotaTree.
- For hierarchical quotas: The upstream plugin doesn't support tree-structured quotas. Use Koordinator or Alibaba Cloud's
ElasticQuotaTreeif you need parent/child quota delegation. - Mitigation for cross-node preemption: Keep
sum(all ElasticQuota min)strictly less than total cluster capacity (as recommended in KEP #9) to reduce the chance of fragmented borrowed resources. - Monitoring: Track namespace-level
usedvsminvsmaxvia the ElasticQuota status fields, and alert when namespaces are chronically blocked from reclaiming their guaranteed resources.
Sources used
- KEP 9 - Capacity Scheduling (scheduler-plugins) -
https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/kep/9-capacity-scheduling/README.md - Capacity Scheduling Plugin Docs -
https://scheduler-plugins.sigs.k8s.io/docs/plugins/capacity-scheduling/ - CapacityScheduling pkg README -
https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/capacityscheduling/README.md - DeepWiki Architecture Overview -
https://deepwiki.com/kubernetes-sigs/scheduler-plugins/3.3-capacityscheduling-plugin - Kubernetes Scheduling: the scheduler-plugins project (Aleskandro) -
https://www.aleskandro.com/posts/kubernetes-scheduler-p3-plugins/ - Evaluation of Kubernetes Schedulers for a Community Cloud Computing Model (ACM PEARC '24) -
https://dl.acm.org/doi/fullHtml/10.1145/3626203.3670520 - GitHub Issue #424 - ElasticQuota default max=0 bug -
https://github.com/kubernetes-sigs/scheduler-plugins/issues/424 - Batch Scheduling on Kubernetes: Comparing Apache YuniKorn, Volcano.Sh, and Kueue (InfraCloud) -
https://www.infracloud.io/blogs/batch-scheduling-on-kubernetes/ - Alibaba Cloud ACK - ElasticQuotaTree documentation -
https://help.aliyun.com/zh/ack/ack-managed-and-ack-dedicated/user-guide/elasticquotatree - Kueue - Kubernetes-native Job Queueing -
https://kueue.sigs.k8s.io/ - Introducing Kueue (Kubernetes Blog, Oct 2022) -
https://kubernetes.io/blog/2022/10/04/introducing-kueue/ - Koordinator - Hierarchical Elastic Quota -
https://koordinator.sh/docs/user-manuals/capacity-scheduling