v2.6.0 MEMPHIS, TN OPEN TO RELOCATION UPTIME: Since 2019

AI/ML platform &
DevOps engineering for
GPU and HPC systems.

I'm Arafath Mohammed — an AI/ML Platform & DevOps Engineer with deep Site Reliability experience, operating GPU fleets, distributed-training infrastructure, and Kubernetes at scale. I work end-to-end across CI/CD, observability, incident response, and SLO design — with a specialty in AI infrastructure: CUDA / PyTorch runtimes, distributed-training observability, checkpoint & recovery systems, and multi-tenant GPU scheduling.

View capabilities → Download résumé GitHub ↗

arafath@platform-prod ~ %

MTTD reduction 18m → 90s

Sev-1/Sev-2 incidents led 65+

GPU utilization lifted 48 → 79%

Job restart cost cut 71%

Status healthy

01 //

Capabilities — what I bring to a platform team

core practice areas

End-to-end reliability · SLOs

SLOs that balance velocity and stability.

Define and own SLOs across CI/CD, observability, and runtime layers — including error-budget policies and multi-window multi-burn-rate alerting that gate releases without blocking velocity. Authored 40+ postmortems with 92% of follow-ups closed within 30 days.

Distributed-training observability

Telemetry across the full training path.

Built multi-layer observability across host, NIC/RDMA, NCCL, container, framework, and scheduler using Prometheus, Mimir, Grafana, OpenTelemetry, and DCGM. Reduced GPU and fabric failure MTTD from ~18 minutes to under 90 seconds.

Kubernetes at scale

Heterogeneous GPU clusters in production.

Operated 1,200+ node Kubernetes fleets across EKS and GKE with mixed CPU/GPU workloads — namespace isolation, resource quotas, NetworkPolicies, NVIDIA GPU Operator, MIG partitioning, Karpenter-managed spot scaling.

Incident response · postmortems

Rapid recovery, blameless review, recurrence kill.

Led incident command for 65+ Sev-1/Sev-2 events including a multi-region NCCL deadlock that stalled a 1,024-GPU run, etcd corruption affecting a control plane, and an upstream object-storage outage. Drove every one to systemic fixes that prevented recurrence.

Checkpoint & recovery systems

Long-running jobs that survive hardware failures.

Built a checkpoint-and-recovery framework with async sharded writes to S3, cross-region replication, integrity verification, and automatic resume on hardware failure. Recovered 7-figure GPU-hours over 12 months and cut job restart cost by 71%.

Multi-tenant isolation

LoRA co-scheduling without compromising separation.

Hardened multi-tenant isolation for LoRA-based co-scheduled fine-tuning workloads using Kubernetes namespaces, NVIDIA MIG partitioning, NetworkPolicies, gVisor sandboxing, and per-tenant quota enforcement. Raised GPU utilization from 48% to 79% with zero cross-tenant data incidents.

Tooling & automation

Software written to solve reliability problems.

Production code in Python, Go, and Bash: a Go-based GPU health-check daemon (DCGM + nvidia-smi + dmesg, auto-cordon, JIRA bundle filing) that eliminated ~85% of silent bad-GPU stalls. Plus chaos-engineering toolkits, Prometheus exporters, and Terraform/OPA policy guardrails.

Cloud at scale

Production services across AWS, GCP, Azure.

Deep operational experience with EKS, GKE, AKS, EC2, S3, IAM, VPC, Lambda, CloudWatch, plus Terraform, Helm, Argo CD/Workflows, and Karpenter. Comfortable with the unglamorous parts: IAM least-privilege, network isolation, secrets management, compliance audits.

02 //

Currently — what's on the bench right now

interests · side work · learning

LLMs & AI Engineering

Studying the internals of modern LLMs, transformer architecture, experimenting by fine-tuning LLMS and SLMs(small language models), mechanistic interpretability and building systems around them on cloud and also locally. Working multi-agent pipelines with Claude, Gemini, and open-weights models.

Focused on finding ways of "using" LLMs, but more interested in operating them: routing, observability, cost discipline, and the infrastructure underneath.

Observability Research

Mapping out what production-grade GPU observability looks like at hyperscale. DCGM exporters, Prometheus sharding strategies, Mimir long-term storage, SNMP / IPMI / Redfish telemetry from the rack up. Inspired by Colossus-scale (~200K GPU) deployments.

Goal: an honest reference architecture for monitoring a large AI training fleet end-to-end — node health → fabric → scheduler → framework.

Sharpening Fundamentals

Working through senior-level Python — async & concurrency for FastAPI, data structures, exception design — alongside K8s, Terraform, and Go refreshers. The goal isn't pass-the-screen; it's being able to explain why a design choice is right.

Same energy applied to system design and AI-infra interview prep — distributed training failure modes, checkpoint design, multi-tenant GPU isolation.

Looking For

Contract or full-time AI/ML platform, GPU infrastructure, or HPC DevOps roles where the work has weight. Memphis-based, open to relocation or hybrid.

03 //

Work history — prod incidents survived

2018 → present

Nov 2023 → present

AI ML Platform Engineer

Client: FedEx

Reliability ownership across distributed-training infrastructure and platform services.

Designed multi-layer observability across the full training path (host → NIC/RDMA → NCCL → container → framework → scheduler); cut MTTD for GPU and fabric failures from ~18 minutes to under 90 seconds.
Operated and maintained GPU-accelerated compute infrastructure supporting AI/ML training and inference workloads.
Deployed NVIDIA Tesla T4 GPU environments with CUDA, cuDNN, and NCCL stacks, serving containerized AI models (PyTorch, TensorFlow) via FastAPI with API key authentication and systemd service management..
Designed and implemented CI/CD pipelines using Jenkins, GitHub Actions, and Ansible for automated infrastructure provisioning
Managed containerized AI runtimes using Docker and Singularity, ensuring reproducible compute environments with pinned dependencies, validated CUDA compatibility, and consistent GPU driver versions across training and inference nodes.

K8sDCGMNCCLMimirOpenTelemetryGoAWS

40+Sev-1/2 led

Jun 2023 → Nov 2023

Site Reliability Engineer

Grant Thornton

Production K8s operations across 6 EKS/GKE clusters supporting 300+ microservices and ML inference.

Operated 1,200+ node Kubernetes fleet (mixed CPU/GPU) with a 99.95% availability target; drove on-call IC for 25+ Sev-1/Sev-2 incidents.
Closed 92% of postmortem follow-ups within 30 days — failovers, cert-expiry outages, autoscaler thrash.
Built modular Terraform + Ansible infrastructure-as-code pipelines for automated provisioning of compute, storage, and networking resources, cutting environment setup from 3 days to 4 hours, used by 20+ engineers.
Rolled out automated patch cycles and configuration hardening across 200+ Linux hosts using Ansible playbooks triggered through Jenkins.
Implemented observability stack using Prometheus, Grafana, and Loki for centralized metrics, log aggregation, dashboards, and alerting tracking scheduler activity, resource utilization, and service health KPIs across distributed infrastructure.
Designed multi-region DR with zero data loss and <5min RTO; reduced manual deployment effort by 70% with Terraform + Ansible.

EKSGKEMimirKarpenterTerraformSplunk

1.2Knode fleet

Mar 2021 → Aug 2021

Senior ERC Associate

Amazon

Improved the efficiency and user experience of internal HR tools by offering feedback on system usability. Identified gaps in application performance, collaborated with the product and development teams to suggest actionable improvements.

UATInternal toolingCustomer support

500weekly cases

Jan 2019 → Feb 2021

Site Reliabitly Engineer

Web Initiate

Developed Python backend services deployed on AWS EC2 and Elastic Beanstalk, building REST APIs and automation scripts for enterprise applications supporting high-throughput data processing workflows.
Stood up the company's first observability stack (Prometheus, Grafana, Loki, Jaeger); golden-signal dashboards became the team standard.
Authored 30+ runbooks and led postmortems for outages including a Redis cluster split-brain and an IAM mis-scoping that exposed an internal endpoint — both shipped guardrails (Terraform + OPA, Redis Sentinel quorum config).
Built a Python + Boto3 toolchain enforcing CIS benchmarks, IAM least-privilege analysis, and S3 public-access remediation across 30+ AWS accounts.

EKSTerraformArgoCDBoto3OPA

30+AWS accounts

Jun 2018 → Dec 2018

Cloud Engineer Intern

Zobibox

Java/Spring Boot microservices, AWS integrations (SQS, RDS, Lambda), and hands-on RHEL system administration. Contributed to Kubernetes-based deployment of banking microservices and supported senior engineers on REST API design, database modeling, and Prometheus/Grafana cluster monitoring.

Spring BootAWSRHELK8s

1yrfoundation

04 //

AI/ML specialty — infra that knows what it's running

specialty area

Distributed Training

NCCL / RDMA debugging

Multi-region NCCL deadlocks, collective-comm failures, fabric flapping. Built preflight collective checks that catch fabric issues before a job starts wasting GPU-hours.

Checkpoint Systems

Async sharded checkpointing

S3 sharded writes with cross-region replication, integrity verification, and automatic resume. Recovered 7-figure GPU-hours of training progress over 12 months.

GPU Telemetry

CUDA stack + DCGM + XID monitoring

Go daemon scraping DCGM, nvidia-smi, and kernel dmesg for XID faults, ECC double-bit errors, and PCIe link degradation across the CUDA / cuDNN / NCCL stack. Auto-cordons unhealthy nodes — eliminated ~85% of silent stalls.

CUDA / PyTorch Runtime

NVIDIA stack & framework support

Manage CUDA, cuDNN, and NCCL stack compatibility across compute nodes; support PyTorch and TensorFlow workloads with containerized runtimes (Docker, Singularity). Triage version-mismatch, OOM, and accelerator-aware scheduling issues end-to-end.

LLM Serving

PyTorch + FastAPI GPU inference

Production PyTorch serving with FastAPI, lazy model loading, auto-unload, bfloat16 + greedy decoding. Built API-key auth, HIPAA-aware design, and systemd-managed lifecycle for medical-domain inference on a Tesla T4.

Multi-Agent Pipelines

Routed multi-LLM systems

Plugin-based agent architectures across Claude, Gemini, and DeepSeek — each model chosen for what the task actually needs (Sonnet for constraint-following, Pro for context, Flash for high-volume extraction).

RAG & Embeddings

ChromaDB & production retrieval

RAG pipelines with ChromaDB for grounded Q&A, source attribution, and incremental indexing — chosen over fine-tuning where the corpus grows continuously.

LoRA / PEFT

Multi-tenant fine-tuning

Co-scheduled LoRA workloads on shared GPU pools with MIG partitioning, gVisor sandboxing, and per-tenant quota enforcement. Lifted utilization from 48% to 79%.

Routing & Cost

Complexity-aware model routing

Lightweight classifiers that route queries to Haiku vs Sonnet by complexity — keeps latency and cost down on simple lookups while reserving the bigger model for what actually needs it.

05 //

Stack — service inventory

production · daily-driver

Orchestration & Compute

Kubernetes (EKS, GKE, AKS) DAILY
Helm & Kustomize DAILY
Argo CD / Argo Workflows PROD
Karpenter PROD
Docker / containerd DAILY
Linux / systemd DAILY

Observability

Prometheus / Mimir / Thanos DAILY
Grafana / Loki / Tempo DAILY
OpenTelemetry DAILY
NVIDIA DCGM Exporter PROD
eBPF (Pixie, Parca) PROD
PagerDuty / Sentry DAILY

Cloud Platforms

AWS (EKS, EC2, S3, IAM, etc) DAILY
GCP (GKE, Compute, Vertex) DAILY
Azure (AKS, AAD) PROD
Terraform DAILY
CloudFormation PROD
OPA / policy-as-code PROD

AI / ML Infrastructure

PyTorch DAILY
CUDA / cuDNN / NCCL PROD
Distributed training (RDMA, MPI) PROD
Checkpoint & recovery systems PROD
LoRA / PEFT / DPO PROD
LLM serving (FastAPI, vLLM) DAILY
RAG (ChromaDB, embeddings) DAILY
Singularity / Docker containers PROD
Slurm / LSF schedulers FAMILIAR
Multi-LLM routing DAILY

Languages & Tooling

Python DAILY
Go PROD
Bash DAILY
SQL DAILY
TypeScript / React PROD
gRPC / Kafka PROD

CI/CD & Reliability Practice

GitHub Actions / GitLab CI DAILY
Jenkins / Azure DevOps PROD
SLO / SLI design DAILY
Incident command DAILY
Postmortem authoring DAILY
Chaos / failure injection PROD

AI/ML platform & DevOps engineering for GPU and HPC systems.