DevOps/ Site Reliability Engineer
Job Description
Job title: DevOps/ Site Reliability EngineerLocation: Dubai, UAE
Reporting to: Head of Dev Sec Ops
About noon
We're building an ecosystem of digital products and services that power everyday life across the Middle Eastfast, scalable, and deeply customer-centric. Our mission is to deliver to every door every day. We want to redefine what technology can do in this region, and we're looking for a
DevOps/ Site Reliability Engineer who can help us move even faster.
noon's mission: Every door, every day.
What you'll do:
Team noon has some of the fastest, smartest, and hardest-working people we've encountered. With a young, aggressive, and talented team, we're driving major missions forward. As a DevOps/ Site Reliability Engineer at noonpayments, you'll be the backbone of infrastructure stability and performance.You will drive automation, reliability, and observability across mission-critical servicesprimarily in Azure (VMSS) and GCP (MIG), with a strong emphasis on Terraform, Azure DevOps, Shell scripting, and Datadog.
Your toolkit will be codenot manual clicks. Your playground: production. Your mission: eliminate toil and chase the 9s. You will:
Cloud & Linux Infrastructure- Administer and tune Linux-based VM workloads (Ubuntu/RHEL) across Azure VMSS and GCP MIG.
- Harden, scale, and monitor VMs for critical payment flows and backend services.
- Define and manage infrastructure using Terraform with modular, reusable patterns.
- Own the infrastructure lifecycle from provisioning to teardown with GitOps principles.
- Build and manage Azure DevOps Pipelines for automated provisioning, deployment, and config drift checks.
- Write and maintain Shell scripts for system bootstrapping, diagnostics, log scraping, and ad-hoc ops automation.
- Build and maintain Datadog monitors, dashboards, and traces.
- Define SLOs/SLIs and drive proactive alerting to detect issues before impact.
- Operate and maintain RabbitMQ clusters for high-throughput messaging.
- Tune and monitor MongoDB instances for latency, failover, and capacity.
- Participate in 24/7 on-call with ownership of reliability, fast mitigation, and RCA.
- Run post-mortems, reduce MTTR, and automate fixes.
- Analyze usage patterns and forecast capacity requirements.
- Identify and fix system bottlenecks, memory leaks, I/O contention, or misconfigurations.
- Partner with product, platform, and security teams to roll out resilient architectures.
- Conduct infrastructure reviews, audits, and chaos testing.
- Maintain detailed runbooks, IaC diagrams, and incident playbooks.
What you'll need:
- 6+ years experience in DevOps / SRE roles with production ownership.
- Advanced Linux administration and troubleshooting skills.
- Mastery in Terraform, with deep understanding of state, modules, and secrets management.
- Proven delivery of CI/CD pipelines using Azure DevOps, YAML-first mindset.
- Shell scripting ninjacan write, debug, and optimize scripts in Bash/Zsh/sh.
- In-depth monitoring and tracing skills using Datadog, including custom metrics and integrations.
- Experience running and tuning RabbitMQ and MongoDB at scale.
- Familiarity with Azure VMSS, GCP MIG, and VM auto-healing strategies.
- Comfortable with 24/7 on-call, SLOs, SLIs, and incident-driven culture.
- Bonus: Experience in payment systems or financial-grade uptime environments.
- We're looking for people with high standards, who understand that hard work matters.
- You need to be relentlessly resourceful and operate with a deep bias for action.
- We need people with the courage to be fiercely original.
- noon is not for everyone; readiness to adapt, pivot, and learn is essential.