AI / MLOps Engineer (Infrastructure, Monitoring & Deployment)
Job Description
SourceURL:file:///home/user/Downloads/AI Hiring post (1).docx
Summary
We are seeking a highly skilled AI / MLOps Engineer to build, deploy, monitor, and manage large-scale AI infrastructure based on HGX H200 nodes. You will play a central role in deploying LLMs, fine-tuning models, automating CI/CD workflows, monitoring model behavior, and maintaining uptime.This role spans infrastructure setup, orchestration, model serving, and operational reliability, and will closely support all aspects of AI model lifecycle in a production environment.
Key Responsibilities
Operate andmanage Kubernetes or OpenShift clusters for multi-node orchestration
Deploy and manage LLMs and other AI models for inference using Triton Inference Server or custom endpoints
Automate CI/CD pipelines for model packaging, serving, retraining, and rollback using GitLab CI or ArgoCD
Set up model and infrastructure monitoring systems (Prometheus, Grafana, NVIDIA DCGM)
Implement model drift detection, performance alerting, and inference logging
Manage model checkpoints, reproducibility controls, and rollback strategies
Track deployed model versions using MLFlow or equivalent registry tools
Implement secure access controls for model endpoints and data artifacts
Collaborate with AI / Data Engineer to integrate and deploy fine-tuned datasets
Ensure high availability, performance, and observability of all AI services in production
Required Qualifications
3+ years experience in DevOps, MLOps, or AI/ML infrastructure roles
10+ overall experience with solution operations
Proven experience with Kubernetes or OpenShift in production environments, preferably certified.
Familiarity with deploying and scaling PyTorch or TensorFlow models for inference
Experience with CI/CD automation tools with Open Shift / Kubernetes
Hands-on experience with model registry systems (e.g., MLFlow, KubeFlow)
Experience with monitoring tools (e.g., Prometheus, Grafana) and GPU workload optimization
Strong scripting skills (Python, Bash) and Linux system administration knowledge
Preferred (Bonus) Skills
Experience with Triton Inference Server or NVIDIA AI stack