| 1. User Management | Supports unified user management via WEB portal, including user creation, deletion, information/status modification, and batch import; supports multi-dimensional hierarchical user management by user, organizational structure, unit, and project; provides multi-channel message push via in-site messages, email, and system notifications, covering job status, system alerts, account notifications and more; allocates independent private data storage for each user, with ACL-based fine-grained data sharing within groups, between users, and across groups; adopts role-based permission management with fine-grained access control and hierarchical authorization; equipped with dynamic authentication module—only scheduled users can log in to compute nodes, restricting ordinary user access outside of job periods. |
| 2. Job Management | Supports unified WEB-based submission, monitoring, termination, data management, and file transfer for three task types: general simulation, GPU-accelerated simulation, and AI training; provides job script creation and template-based configuration, with script integration in simulation/AI training submission pages; built-in WEB AI task processing pages covering deep learning model training, hyperparameter tuning, model evaluation, and model format conversion; supports integration with visualization tools such as Tensorboard and MindInsight for real-time AI training monitoring; equipped with large file transfer component supporting batch upload/download of files/folders, 100GB+ single file transfer with resume capability; supports job/task template integration for standardized rapid submission. |
| 3. Job Scheduling | Unified management and scheduling of general compute nodes and AI accelerator nodes, with full-process resource query, control, and job query capabilities; supports multiple scheduling algorithms: FCFS, fair-share, preemptive scheduling, multi-factor backfill, resource exclusion, and resource reservation; CPU binding capability; compatible with HPC-AI hybrid computing framework job scheduling; supports containerized scheduling management for simulation/ML application containers with run/stop/attach/exec/log operations; supports mixed scheduling of regular and container jobs with unified management and dynamic scheduling. |
| 4. Container Management | Provides unified management and scheduling integration for containerized runtime environments; equipped with image registry management interface supporting private/public registry integration with image upload, download, version management, and permission control; provides container packaging script parsing interface for standardized script parsing with automatic application environment, dependency, and startup parameter packaging; supports container image import, export, and build; enables one-click container job packaging based on image integration scripts with automatic runtime preparation, resource scheduling integration, and job startup. |
| 5. Resource Monitoring | Provides resource overview visualization page displaying cluster-wide operational status and core load metrics at a glance; multi-dimensional node monitoring: CPU utilization, GPU utilization, memory utilization, VRAM utilization, IO, page swap, network traffic; supports unified software license monitoring with license status visualization and usage statistics analysis; displays server operational status based on rack topology, simultaneously showing simulation and AI training job execution on nodes. |
| 6. Operations Management | Interfaces with out-of-band management for all cluster node types: IPMI, RESTful, SNMP, compatible with general/AI accelerator compute nodes, management/login nodes, and I/O servers; provides centralized hardware status monitoring and comprehensive visualization with fault alerting, recording, querying, and full lifecycle management; supports remote node control: power on/off, restart, firmware update, with batch remote operations; provides virtual KVM and virtual optical drive for AI accelerator nodes; supports WEB remote terminal connection to all nodes for remote debugging and operations. |
| 7. Resource Statistics | Generates multi-dimensional statistical analysis reports and visualizations based on comprehensive cluster monitoring, job execution, and user operation data; supports job-dimension analysis: job count, job throughput, CPU time, GPU time; provides job ranking statistics by user and application dimensions with multi-dimensional data filtering and export. |
| 8. Resource Billing | Supports billing for multiple resource types: CPU, GPU, memory, storage; supports flexible rate setting: monthly rent, annual rent, storage quota billing, with tiered pricing by cluster, partition, and service quality; supports multi-dimensional billing statistics: user, organization, unit, project for fine-grained billing management; supports multi-role tiered pricing for accounts, tenants, and platform administrators with automatic billing statement generation. |