In environments with a high number of GPU users, the lack of unified GPU resource allocation and management often leads to imbalanced usage, resource idling, or contention, resulting in low overall utilization. Moreover, when multiple developers share a single machine, differences in environments and workflows can cause conflicts and inefficiencies.
Kube Manager, as a container-based cloud platform, provides unified scheduling and allocation of large-scale CPU and GPU resources. It enables dynamic, on-demand GPU allocation to prevent resource imbalance, supports GPU sharing (allowing multiple users to utilize a single GPU card concurrently), and eliminates contention. Leveraging containerization, Kube Manager ensures environment isolation so that multiple users can share the same machine without conflicts.
Kube Manager comprises six core modules: Resource Monitoring, Application Management, Image Management, File Management, User Management, and Resource Quotas.
- Quickly select required CPU, memory, GPU, and VRAM when creating applications.
- Enforced quotas prevent overuse; quotas can be adjusted as needed.
- CPU allocation granularity: 1‰ of a core; Memory: 1 MB.
- Simplified operations: select an image and resources to create applications in minutes.
- Advanced options include environment variables and startup commands.
- Automatic NFS mounting ensures persistent storage and fast file access.
- Options include full GPU allocation, MIG (Multi-Instance GPU) partitioning, or percentage-based allocation.
- Percentage allocation supports as low as 1% utilization or 0.25 GB of VRAM.
- Cluster, node, user, and container-level monitoring.
- Detailed visualization of CPU, memory, network, and GPU usage.
- Customizable dashboards for advanced monitoring.
- Files stored persistently in NFS; unaffected by container lifecycle.
- Web files automatically mounted into containers for seamless access.
- NVIDIA AI Enterprise (NVAIE) is an end-to-end, cloud-native suite of pretrained AI models and data analytics tools.
- Certified by NVIDIA with global enterprise support, ensuring rapid AI deployment. Note: Commercial usage requires a separate license purchase.
Kube Manager provides diverse GPU scheduling options—including full GPU, MIG partitioning, and percentage-based allocation—allowing the department to unify resource management across servers, resolve conflicts, and improve utilization.
Using Kube Manager, faculty and students met diverse resource requirements. GPUs were almost fully utilized across more than ten simultaneous instances, boosting compute efficiency, task throughput, and shortening model development cycles.
Kube Manager enforces quotas and resource limits to prevent excessive usage and waste. With queue-based scheduling and utilization of idle time (night/weekends), the institute improved both resource efficiency and model training throughput.