I'm Arafath Mohammed — an AI/ML Platform & DevOps Engineer with deep Site Reliability experience, operating GPU fleets, distributed-training infrastructure, and Kubernetes at scale. I work end-to-end across CI/CD, observability, incident response, and SLO design — with a specialty in AI infrastructure: CUDA / PyTorch runtimes, distributed-training observability, checkpoint & recovery systems, and multi-tenant GPU scheduling.
Define and own SLOs across CI/CD, observability, and runtime layers — including error-budget policies and multi-window multi-burn-rate alerting that gate releases without blocking velocity. Authored 40+ postmortems with 92% of follow-ups closed within 30 days.
Built multi-layer observability across host, NIC/RDMA, NCCL, container, framework, and scheduler using Prometheus, Mimir, Grafana, OpenTelemetry, and DCGM. Reduced GPU and fabric failure MTTD from ~18 minutes to under 90 seconds.
Operated 1,200+ node Kubernetes fleets across EKS and GKE with mixed CPU/GPU workloads — namespace isolation, resource quotas, NetworkPolicies, NVIDIA GPU Operator, MIG partitioning, Karpenter-managed spot scaling.
Led incident command for 65+ Sev-1/Sev-2 events including a multi-region NCCL deadlock that stalled a 1,024-GPU run, etcd corruption affecting a control plane, and an upstream object-storage outage. Drove every one to systemic fixes that prevented recurrence.
Built a checkpoint-and-recovery framework with async sharded writes to S3, cross-region replication, integrity verification, and automatic resume on hardware failure. Recovered 7-figure GPU-hours over 12 months and cut job restart cost by 71%.
Hardened multi-tenant isolation for LoRA-based co-scheduled fine-tuning workloads using Kubernetes namespaces, NVIDIA MIG partitioning, NetworkPolicies, gVisor sandboxing, and per-tenant quota enforcement. Raised GPU utilization from 48% to 79% with zero cross-tenant data incidents.
Production code in Python, Go, and Bash: a Go-based GPU health-check daemon (DCGM + nvidia-smi + dmesg, auto-cordon, JIRA bundle filing) that eliminated ~85% of silent bad-GPU stalls. Plus chaos-engineering toolkits, Prometheus exporters, and Terraform/OPA policy guardrails.
Deep operational experience with EKS, GKE, AKS, EC2, S3, IAM, VPC, Lambda, CloudWatch, plus Terraform, Helm, Argo CD/Workflows, and Karpenter. Comfortable with the unglamorous parts: IAM least-privilege, network isolation, secrets management, compliance audits.
Focused on finding ways of "using" LLMs, but more interested in operating them: routing, observability, cost discipline, and the infrastructure underneath.
Goal: an honest reference architecture for monitoring a large AI training fleet end-to-end — node health → fabric → scheduler → framework.
Same energy applied to system design and AI-infra interview prep — distributed training failure modes, checkpoint design, multi-tenant GPU isolation.
Multi-region NCCL deadlocks, collective-comm failures, fabric flapping. Built preflight collective checks that catch fabric issues before a job starts wasting GPU-hours.
S3 sharded writes with cross-region replication, integrity verification, and automatic resume. Recovered 7-figure GPU-hours of training progress over 12 months.
Go daemon scraping DCGM, nvidia-smi, and kernel dmesg for XID faults, ECC double-bit errors, and PCIe link degradation across the CUDA / cuDNN / NCCL stack. Auto-cordons unhealthy nodes — eliminated ~85% of silent stalls.
Manage CUDA, cuDNN, and NCCL stack compatibility across compute nodes; support PyTorch and TensorFlow workloads with containerized runtimes (Docker, Singularity). Triage version-mismatch, OOM, and accelerator-aware scheduling issues end-to-end.
Production PyTorch serving with FastAPI, lazy model loading, auto-unload, bfloat16 + greedy decoding. Built API-key auth, HIPAA-aware design, and systemd-managed lifecycle for medical-domain inference on a Tesla T4.
Plugin-based agent architectures across Claude, Gemini, and DeepSeek — each model chosen for what the task actually needs (Sonnet for constraint-following, Pro for context, Flash for high-volume extraction).
RAG pipelines with ChromaDB for grounded Q&A, source attribution, and incremental indexing — chosen over fine-tuning where the corpus grows continuously.
Co-scheduled LoRA workloads on shared GPU pools with MIG partitioning, gVisor sandboxing, and per-tenant quota enforcement. Lifted utilization from 48% to 79%.
Lightweight classifiers that route queries to Haiku vs Sonnet by complexity — keeps latency and cost down on simple lookups while reserving the bigger model for what actually needs it.
Prometheus-based monitoring architecture for ~200K-GPU AI datacenters. Built to demonstrate collection patterns for multi-DC training fleets.
Plugin-based multi-agent pipeline that turns long-form course transcripts into textbook chapters, flashcards, concept maps, and a grounded AI tutor.
MedGemma 4B and Google MedASR served from a Tesla T4 GCP VM via FastAPI + systemd, with API-key auth and HIPAA-aware design.
A pattern I've seen repeatedly: training jobs hang in ways that look like model bugs but are actually network or scheduler problems. Coming soon.
GPU telemetry isn't just "GPU util." A field guide to ECC errors, NVLink saturation, and the metrics that predict thermal throttling before it happens.
Job-completion SLOs and per-step latency SLOs measure different failure modes. How to pick the right one and what each one tells you about your platform.
Open to AI/ML Platform, GPU Infrastructure, HPC DevOps, SRE, and Cloud Platform roles. Particularly interested in research-driven environments — medical AI, scientific computing, distributed training — where reliable infrastructure directly enables real outcomes. Memphis-based, open to relocation.
Email me →