DevOps Engineer - AI Infrastructure & GPU Orchestration
nexus aidc Dubai
Job Description
Company Description
NEXUS is revolutionizing the data center industry with the first AI-native Data Center Operating System. Addressing the growing complexity of AI-driven workloads and infrastructure, our platform unifies DCIM, APM, FinOps, Kubernetes orchestration, AI workload management, and full-stack observability into one intelligent, real-time system.With cutting-edge predictive intelligence and automated remediation, the platform ensures optimized performance, cost efficiency, and seamless AI deployment.At NEXUS, we are shaping a future with autonomous infrastructure intelligence for smarter, more efficient decisions.
Role Description
This is a full-time hybrid role for a DevOps Engineer specializing in AI Infrastructure and GPU Orchestration. The DevOps Engineer will be responsible for building and maintaining scalable infrastructure, implementing infrastructure as code (IaC), developing automation scripts, streamlining continuous integration workflows, and managing Linux-based systems.The role also involves optimizing GPU clusters, collaborating with software developers, and ensuring high system performance to support innovative AI-driven workloads.
Key Responsibilities- GPU Workload Orchestration: Design and manage complex Kubernetes environments (EKS, AKS, GKE, or bare metal) specifically tuned for AI/ML workloads, including GPU scheduling, device plugins, and node affinity.
- DCIM Integration: Build and maintain infrastructure pipelines that interface with Data Center Infrastructure Management (DCIM) systems to monitor power, cooling, and hardware health at the rack level.
- Advanced APM & Telemetry: Implement deep Application Performance Monitoring (APM) and observability stacks (Prometheus, Grafana, Datadog) to track GPU utilization, memory bandwidth, and workload latency in real-time.
- Infrastructure as Code (IaC): Architect and deploy scalable, multi-cloud and hybrid environments using Terraform or equivalents, ensuring our platform can deploy rapidly into diverse enterprise environments.
- CI/CD for AI Infrastructure: Own the CI/CD pipelines (GitHub Actions, GitLab CI) that deliver our orchestration software, ensuring zero-downtime deployments for mission-critical AI systems.
- Performance Tuning: Work closely with the core engineering team to optimize network routing, storage I/O, and compute resource allocation for heavy AI training and inference workloads.
- Minimum 3-5 years of professional experience in DevOps, SRE, or Infrastructure Engineering, with a strong focus on high-performance computing or AI infrastructure.
- Expert-level skills in Terraform,Ansible, or similar technologies and CI/CD automation, coupled with strong scripting abilities in Python, Go, or Bash.
- Strong knowledge of Continuous Integration tools (e.g., Jenkins, GitHub Actions, GitLab CI/CD)
- Background in System Administration and expertise in managing multi-OS-based environments
- Understanding of GPU clusters and handling modern AI workloads
- Deep, hands-on experience with Kubernetes, specifically managing stateful workloads, custom resource definitions (CRDs), and GPU node provisioning.
- Proven ability to design and implement comprehensive APM and telemetry solutions for complex, distributed systems.
- Understanding of data center operations, including power, thermal management, and hardware-level monitoring.
- Multi-cloud infrastructure experience is a plus
- Ability to troubleshoot and optimize performance across complex infrastructure
- Strong problem-solving abilities and a collaborative mindset
ITSecDubai
Job Description
We're Hiring: DevOps Engineer — Dubai, UAE
Full-Time | Mid-Senior | On-site
We're looking for a DevOps Engineer to design, build, and own cloud infrastructure, CI/CD pipelines, security, and platform reliability for a greenfield...
index holdingDubai
cloud infrastructure. The ideal candidate will have hands-on experience managing AWS environments, automating infrastructure using Terraform, and collaborating with DevOps and engineering teams to support CI/CD and operational excellence.
Key Roles...
dicetek llcDubai
Job Description
Role Purpose
Deliver secure, automated, and reliable Azure-based platforms and applications through strong DevOps and operational practices.
Key Responsibilities
• Implement and maintain Azure infrastructure using infrastructure...