SYS_STATUS: AVAILABLE
v2.6.0 MEMPHIS, TN OPEN TO RELOCATION UPTIME: Since 2019

AI/ML platform &
DevOps engineering for
GPU and HPC systems.

I'm Arafath Mohammed — an AI/ML Platform & DevOps Engineer with deep Site Reliability experience, operating GPU fleets, distributed-training infrastructure, and Kubernetes at scale. I work end-to-end across CI/CD, observability, incident response, and SLO design — with a specialty in AI infrastructure: CUDA / PyTorch runtimes, distributed-training observability, checkpoint & recovery systems, and multi-tenant GPU scheduling.

arafath@platform-prod ~ %
MTTD reduction 18m → 90s
Sev-1/Sev-2 incidents led 65+
GPU utilization lifted 48 → 79%
Job restart cost cut 71%
Status healthy
01 //

Capabilities — what I bring to a platform team

SRE · Platform · DevOps · AI Infrastructure

Reliability is the product underneath the product.

I work at the layer where distributed systems, Kubernetes, GPU fleets, and AI infrastructure all converge — the place where training jobs, inference services, and developer platforms have to actually stay up. Below are the practice areas I specialize in, with the kind of work I've shipped behind each one.

End-to-end reliability · SLOs

SLOs that balance velocity and stability.

Define and own SLOs across CI/CD, observability, and runtime layers — including error-budget policies and multi-window multi-burn-rate alerting that gate releases without blocking velocity. Authored 40+ postmortems with 92% of follow-ups closed within 30 days.

Distributed-training observability

Telemetry across the full training path.

Built multi-layer observability across host, NIC/RDMA, NCCL, container, framework, and scheduler using Prometheus, Mimir, Grafana, OpenTelemetry, and DCGM. Reduced GPU and fabric failure MTTD from ~18 minutes to under 90 seconds.

Kubernetes at scale

Heterogeneous GPU clusters in production.

Operated 1,200+ node Kubernetes fleets across EKS and GKE with mixed CPU/GPU workloads — namespace isolation, resource quotas, NetworkPolicies, NVIDIA GPU Operator, MIG partitioning, Karpenter-managed spot scaling.

Incident response · postmortems

Rapid recovery, blameless review, recurrence kill.

Led incident command for 65+ Sev-1/Sev-2 events including a multi-region NCCL deadlock that stalled a 1,024-GPU run, etcd corruption affecting a control plane, and an upstream object-storage outage. Drove every one to systemic fixes that prevented recurrence.

Checkpoint & recovery systems

Long-running jobs that survive hardware failures.

Built a checkpoint-and-recovery framework with async sharded writes to S3, cross-region replication, integrity verification, and automatic resume on hardware failure. Recovered 7-figure GPU-hours over 12 months and cut job restart cost by 71%.

Multi-tenant isolation

LoRA co-scheduling without compromising separation.

Hardened multi-tenant isolation for LoRA-based co-scheduled fine-tuning workloads using Kubernetes namespaces, NVIDIA MIG partitioning, NetworkPolicies, gVisor sandboxing, and per-tenant quota enforcement. Raised GPU utilization from 48% to 79% with zero cross-tenant data incidents.

Tooling & automation

Software written to solve reliability problems.

Production code in Python, Go, and Bash: a Go-based GPU health-check daemon (DCGM + nvidia-smi + dmesg, auto-cordon, JIRA bundle filing) that eliminated ~85% of silent bad-GPU stalls. Plus chaos-engineering toolkits, Prometheus exporters, and Terraform/OPA policy guardrails.

Cloud at scale

Production services across AWS, GCP, Azure.

Deep operational experience with EKS, GKE, AKS, EC2, S3, IAM, VPC, Lambda, CloudWatch, plus Terraform, Helm, Argo CD/Workflows, and Karpenter. Comfortable with the unglamorous parts: IAM least-privilege, network isolation, secrets management, compliance audits.

02 //

Currently — what's on the bench right now

LLMs & AI Engineering

Studying the internals of modern LLMs, transformer architecture, experimenting by fine-tuning LLMS and SLMs(small language models), mechanistic interpretability and building systems around them on cloud and also locally. Working multi-agent pipelines with Claude, Gemini, and open-weights models.

Focused on finding ways of "using" LLMs, but more interested in operating them: routing, observability, cost discipline, and the infrastructure underneath.

Observability Research

Mapping out what production-grade GPU observability looks like at hyperscale. DCGM exporters, Prometheus sharding strategies, Mimir long-term storage, SNMP / IPMI / Redfish telemetry from the rack up. Inspired by Colossus-scale (~200K GPU) deployments.

Goal: an honest reference architecture for monitoring a large AI training fleet end-to-end — node health → fabric → scheduler → framework.

Sharpening Fundamentals

Working through senior-level Python — async & concurrency for FastAPI, data structures, exception design — alongside K8s, Terraform, and Go refreshers. The goal isn't pass-the-screen; it's being able to explain why a design choice is right.

Same energy applied to system design and AI-infra interview prep — distributed training failure modes, checkpoint design, multi-tenant GPU isolation.

Looking For

Contract or full-time AI/ML platform, GPU infrastructure, or HPC DevOps roles where the work has weight. Memphis-based, open to relocation or hybrid.

03 //

Work history — prod incidents survived

Nov 2023 → present
AI ML Platform Engineer
Client: FedEx
Reliability ownership across distributed-training infrastructure and platform services.
  • Designed multi-layer observability across the full training path (host → NIC/RDMA → NCCL → container → framework → scheduler); cut MTTD for GPU and fabric failures from ~18 minutes to under 90 seconds.
  • Operated and maintained GPU-accelerated compute infrastructure supporting AI/ML training and inference workloads.
  • Deployed NVIDIA Tesla T4 GPU environments with CUDA, cuDNN, and NCCL stacks, serving containerized AI models (PyTorch, TensorFlow) via FastAPI with API key authentication and systemd service management..
  • Designed and implemented CI/CD pipelines using Jenkins, GitHub Actions, and Ansible for automated infrastructure provisioning
  • Managed containerized AI runtimes using Docker and Singularity, ensuring reproducible compute environments with pinned dependencies, validated CUDA compatibility, and consistent GPU driver versions across training and inference nodes.
K8sDCGMNCCLMimirOpenTelemetryGoAWS
40+Sev-1/2 led
Jun 2023 → Nov 2023
Site Reliability Engineer
Grant Thornton
Production K8s operations across 6 EKS/GKE clusters supporting 300+ microservices and ML inference.
  • Operated 1,200+ node Kubernetes fleet (mixed CPU/GPU) with a 99.95% availability target; drove on-call IC for 25+ Sev-1/Sev-2 incidents.
  • Closed 92% of postmortem follow-ups within 30 days — failovers, cert-expiry outages, autoscaler thrash.
  • Built modular Terraform + Ansible infrastructure-as-code pipelines for automated provisioning of compute, storage, and networking resources, cutting environment setup from 3 days to 4 hours, used by 20+ engineers.
  • Rolled out automated patch cycles and configuration hardening across 200+ Linux hosts using Ansible playbooks triggered through Jenkins.
  • Implemented observability stack using Prometheus, Grafana, and Loki for centralized metrics, log aggregation, dashboards, and alerting tracking scheduler activity, resource utilization, and service health KPIs across distributed infrastructure.
  • Designed multi-region DR with zero data loss and <5min RTO; reduced manual deployment effort by 70% with Terraform + Ansible.
EKSGKEMimirKarpenterTerraformSplunk
1.2Knode fleet
Mar 2021 → Aug 2021
Senior ERC Associate
Amazon
Improved the efficiency and user experience of internal HR tools by offering feedback on system usability. Identified gaps in application performance, collaborated with the product and development teams to suggest actionable improvements.
UATInternal toolingCustomer support
500weekly cases
Jan 2019 → Feb 2021
Site Reliabitly Engineer
Web Initiate
  • Developed Python backend services deployed on AWS EC2 and Elastic Beanstalk, building REST APIs and automation scripts for enterprise applications supporting high-throughput data processing workflows.
  • Stood up the company's first observability stack (Prometheus, Grafana, Loki, Jaeger); golden-signal dashboards became the team standard.
  • Authored 30+ runbooks and led postmortems for outages including a Redis cluster split-brain and an IAM mis-scoping that exposed an internal endpoint — both shipped guardrails (Terraform + OPA, Redis Sentinel quorum config).
  • Built a Python + Boto3 toolchain enforcing CIS benchmarks, IAM least-privilege analysis, and S3 public-access remediation across 30+ AWS accounts.
EKSTerraformArgoCDBoto3OPA
30+AWS accounts
Jun 2018 → Dec 2018
Cloud Engineer Intern
Zobibox
Java/Spring Boot microservices, AWS integrations (SQS, RDS, Lambda), and hands-on RHEL system administration. Contributed to Kubernetes-based deployment of banking microservices and supported senior engineers on REST API design, database modeling, and Prometheus/Grafana cluster monitoring.
Spring BootAWSRHELK8s
1yrfoundation
04 //

AI/ML specialty — infra that knows what it's running

Distributed training · LLM serving · Multi-agent pipelines

Training failures are infra problems wearing ML clothes.

I read NCCL timeouts and gradient explosions the way backend engineers read stack traces. The same engineer who can debug an etcd quorum issue can debug why a 1,024-GPU run is silently stalled — because both turn out to be problems with timeouts, fabric, and quorum. Below: where SRE meets the AI stack in my work.

Distributed Training
NCCL / RDMA debugging

Multi-region NCCL deadlocks, collective-comm failures, fabric flapping. Built preflight collective checks that catch fabric issues before a job starts wasting GPU-hours.

Checkpoint Systems
Async sharded checkpointing

S3 sharded writes with cross-region replication, integrity verification, and automatic resume. Recovered 7-figure GPU-hours of training progress over 12 months.

GPU Telemetry
CUDA stack + DCGM + XID monitoring

Go daemon scraping DCGM, nvidia-smi, and kernel dmesg for XID faults, ECC double-bit errors, and PCIe link degradation across the CUDA / cuDNN / NCCL stack. Auto-cordons unhealthy nodes — eliminated ~85% of silent stalls.

CUDA / PyTorch Runtime
NVIDIA stack & framework support

Manage CUDA, cuDNN, and NCCL stack compatibility across compute nodes; support PyTorch and TensorFlow workloads with containerized runtimes (Docker, Singularity). Triage version-mismatch, OOM, and accelerator-aware scheduling issues end-to-end.

LLM Serving
PyTorch + FastAPI GPU inference

Production PyTorch serving with FastAPI, lazy model loading, auto-unload, bfloat16 + greedy decoding. Built API-key auth, HIPAA-aware design, and systemd-managed lifecycle for medical-domain inference on a Tesla T4.

Multi-Agent Pipelines
Routed multi-LLM systems

Plugin-based agent architectures across Claude, Gemini, and DeepSeek — each model chosen for what the task actually needs (Sonnet for constraint-following, Pro for context, Flash for high-volume extraction).

RAG & Embeddings
ChromaDB & production retrieval

RAG pipelines with ChromaDB for grounded Q&A, source attribution, and incremental indexing — chosen over fine-tuning where the corpus grows continuously.

LoRA / PEFT
Multi-tenant fine-tuning

Co-scheduled LoRA workloads on shared GPU pools with MIG partitioning, gVisor sandboxing, and per-tenant quota enforcement. Lifted utilization from 48% to 79%.

Routing & Cost
Complexity-aware model routing

Lightweight classifiers that route queries to Haiku vs Sonnet by complexity — keeps latency and cost down on simple lookups while reserving the bigger model for what actually needs it.

05 //

Stack — service inventory

Orchestration & Compute

  • Kubernetes (EKS, GKE, AKS) DAILY
  • Helm & Kustomize DAILY
  • Argo CD / Argo Workflows PROD
  • Karpenter PROD
  • Docker / containerd DAILY
  • Linux / systemd DAILY

Observability

  • Prometheus / Mimir / Thanos DAILY
  • Grafana / Loki / Tempo DAILY
  • OpenTelemetry DAILY
  • NVIDIA DCGM Exporter PROD
  • eBPF (Pixie, Parca) PROD
  • PagerDuty / Sentry DAILY

Cloud Platforms

  • AWS (EKS, EC2, S3, IAM, etc) DAILY
  • GCP (GKE, Compute, Vertex) DAILY
  • Azure (AKS, AAD) PROD
  • Terraform DAILY
  • CloudFormation PROD
  • OPA / policy-as-code PROD

AI / ML Infrastructure

  • PyTorch DAILY
  • CUDA / cuDNN / NCCL PROD
  • Distributed training (RDMA, MPI) PROD
  • Checkpoint & recovery systems PROD
  • LoRA / PEFT / DPO PROD
  • LLM serving (FastAPI, vLLM) DAILY
  • RAG (ChromaDB, embeddings) DAILY
  • Singularity / Docker containers PROD
  • Slurm / LSF schedulers FAMILIAR
  • Multi-LLM routing DAILY

Languages & Tooling

  • Python DAILY
  • Go PROD
  • Bash DAILY
  • SQL DAILY
  • TypeScript / React PROD
  • gRPC / Kafka PROD

CI/CD & Reliability Practice

  • GitHub Actions / GitLab CI DAILY
  • Jenkins / Azure DevOps PROD
  • SLO / SLI design DAILY
  • Incident command DAILY
  • Postmortem authoring DAILY
  • Chaos / failure injection PROD
06 //

Personal projects — code in the open

2025 · OPEN-SOURCE

xAI-Scale Datacenter Metrics

Prometheus-based monitoring architecture for ~200K-GPU AI datacenters. Built to demonstrate collection patterns for multi-DC training fleets.

  • Cell-based sharding — each Prometheus instance scrapes ~200 servers; fleet scales by adding cells.
  • Full-stack telemetry — DCGM (GPU), node_exporter, IPMI/Redfish (BMC), SNMP (PDUs/UPS/switches), and a custom REST→Prometheus cooling-plant exporter.
  • Mimir-backed global store with object-storage long-term retention.
  • Cardinality discipline — strict label controls; UUIDs and PIDs explicitly excluded.
  • Multi-target IPMI/SNMP relabeling, file_sd target generation from inventory CSV, full Docker Compose demo stack.
PrometheusMimirDCGMSNMPIPMIPythonDocker
View repo →
2025 · MULTI-AGENT LLM

Course Distiller

Plugin-based multi-agent pipeline that turns long-form course transcripts into textbook chapters, flashcards, concept maps, and a grounded AI tutor.

  • Routed multi-LLM architecture — Claude Sonnet for constraint-following grouping, Gemini 2.5 Pro for long-context distillation, Flash for high-volume enrichment, Haiku/Sonnet router for tutoring, DeepSeek for news.
  • 3-batch + merge strategy for chapters that exceed single-prompt context — distill in batches, then run a merge pass that produces a single-author narrative.
  • RAG-grounded tutor with ChromaDB and source attribution; chosen over fine-tuning because the corpus grows incrementally.
  • Cross-device sync via Turso (cloud SQLite) — laptop dev + GCP VM mobile access, no Postgres overhead.
  • First run: 324 lectures (91 hours of video) → 45 structured chapters with flashcards, code challenges, and concept maps.
FastAPIReactChromaDBTursoClaudeGeminiGCP
View repo →
2025 · GPU INFERENCE

Medical ML API

MedGemma 4B and Google MedASR served from a Tesla T4 GCP VM via FastAPI + systemd, with API-key auth and HIPAA-aware design.

  • Production GPU inference — bfloat16 + greedy decoding for deterministic medical Q&A; chat-template formatting per MedGemma's spec.
  • Lazy model loading + auto-unload for memory pressure management on a single-T4 VM.
  • MedASR via Pipeline API (not raw model loading) for speech-to-text transcription.
  • API-key auth and HIPAA-aware design — connects to a Spring Boot backend via SSH tunnel.
  • systemd-managed lifecycle, structured logging, and lifecycle metrics for the on-call rotation.
MedGemmaMedASRFastAPITesla T4GCPsystemdHIPAA
View repo →
07 //

Writing — notes from the on-call rotation

08 //

Contact — page me

If you're hiring for the layer
where AI systems actually run,
let's talk.

Open to AI/ML Platform, GPU Infrastructure, HPC DevOps, SRE, and Cloud Platform roles. Particularly interested in research-driven environments — medical AI, scientific computing, distributed training — where reliable infrastructure directly enables real outcomes. Memphis-based, open to relocation.

Email me
© 2026 Arafath Mohammed Built with restraint and uptime in mind.