Senior Engineer - HPC Operations

apartmentcore42 placeAbu Dhabi calendar_month02/06/2026

Job Description

About Us

Core42, a leader in AI-powered cloud and digital infrastructure, is driving transformative technology solutions globally. Leveraging advanced resources and partnerships, Core42 empowers clients to harness sovereign AI infrastructure, especially in sectors with stringent regulatory needs.

With a mission to redefine digital transformation, we combine sovereign capabilities with scalable, high-performance compute infrastructure, positioning itself at the forefront of AI innovation in the Middle East and beyond.

The opportunity

We are seeking a highly skilled Senior Engineer – HPC Operations to oversee the daily operations and support of high-performance computing clusters designed to power large-scale AI and ML workloads. This role ensures stable, secure, and high-performing infrastructure leveraging technologies such as Slurm, Kubernetes, and modern MLOps platforms.

The ideal candidate will bring deep technical expertise in HPC and a strong operational mindset to drive continuous improvement and automation across globally distributed environments. Responsibilities will extend to collaborating with multidisciplinary teams, leading complex projects, implementing cutting-edge technologies, and providing mentorship to operations engineers.

Your key responsibilities

Lead the daily operational support of HPC infrastructure including compute, storage, networking, and scheduler components (Slurm, Kubernetes, etc.).
Lead efforts to maximize the efficiency and performance of HPC systems, ensuring optimal resource utilization and minimal downtime.
Act as the primary technical escalation point for L2 support teams and ensure prompt resolution of incidents and service requests.
Monitor system health, performance, and utilization using advanced tools (e.g., Prometheus, Grafana, DCGM).
Manage user environments for AI/ML workloads including container orchestration (e.g., Docker, Kubernetes) and workflow tools (e.g., MLflow, Kubeflow).
Implement and manage job scheduling policies, priorities, and partitions within Slurm and/or Kubernetes environments to ensure fairness and efficiency.
Lead root cause analysis (RCA) of operational issues and contribute to post-mortem documentation and continuous improvement efforts.
Provide mentorship and guidance to junior engineers and participate in on-call rotation if required.
Ensure compliance with security and operational policies; assist in audits and documentation for change and incident management processes.

Qualifications:

What we're looking for

(a) Required skills / qualifications

Bachelor's or Master's degree in Computer Science, Engineering, or related technical field.
7+ years of experience in HPC operations, systems engineering, or DevOps roles.
Advanced knowledge and expertise in configuring, optimizing, and maintaining complex HPC environments, including hardware, software, and storage systems.
Hands-on experience managing Slurm clusters and/or Kubernetes-based environments for AI/ML workloads.
Expert knowledge of GPU resource management, workload schedulers, and performance tuning for AI/ML workloads.
Experience with monitoring and observability frameworks such as Prometheus, Grafana, and DCGM.
Strong scripting and automation skills (Python, Bash, Ansible, Terraform).
In-depth understanding of Linux (RHEL/CentOS/Ubuntu), networking concepts (RDMA, InfiniBand, RoCE), and storage technologies (NFS, Lustre, Ceph).

What working at Core42 offers

With a diverse team of 1,100+ employees from 68 nationalities, we foster an inclusive, innovative and collaborative environment. At Core42, we foster a culture grounded in trust, accountability and high performance. We are united by our values: Grit, where we overcome challenges with resilience and determination, Passion, which drives us to pursue excellence in everything we do, and Impact, as we aim to inspire progress and create meaningful change.

Our team members thrive in an environment where each person's contributions propel us forward, and together, we commit to achieving extraordinary results.

Competitive Salary: We offer an attractive salary package based on your skills and experience
Yearly Bonus: In recognition of your contributions, you will receive a performance-based annual bonus
Exclusive Discount Cards: Access special benefits with Esaad and Fazaa cards, offering discounts across a wide range of services
Premium Family Insurance: We provide comprehensive health coverage, including dental, vision and life insurance, ensuring the well-being of you and your family
Learning & Development: We offer access to top-tier learning platforms to help you grow in your career. Learn at your own pace with unlimited access to premium courses.

thumb_up_altRecommended

Supervisor, Operations Services (CNG)

placeAbu Dhabi

JOB PURPOSE: To plan and ensure safe and efficient Operation of facilities in the CNG & Hydrogen filling stations as per standards and Manufactures recommendations. Optimum utilization of available resources - manpower and materials. Efficient...

local_fire_departmentUrgent

Operations Supervisor

apartmentAmazonplaceAbu Dhabi

will oversee the shift management in one of our first mile fulfillment centers, middle mile sortation centers or final mile delivery stations. Our Operations Supervisors drive the pulse of our operations on the ground and play a key role in getting our...

check_circleNew offer

Operations Executive, (UAE National), Amazon Now, UFG

apartmentAmazon FiltersplaceAbu Dhabi

Job Description Description Do you want to be part of a newly formed organization that is designing and launching new business models across MENA If so, Amazon Now operations is looking for an experienced Operations Lead with a strong record...

Best jobs you don't want to miss:

Operations Manager Jobs in Abu Dhabi

Business Operations Jobs in Abu Dhabi 7 Urgent

Operations Analyst Jobs in Abu Dhabi

Operations Executive Jobs in Abu Dhabi 6 Urgent

Banking Operations Jobs in Abu Dhabi