Reliability at scale
10,000+ servers and 400,000+ cloud assets across AWS, Azure, GCP and Alibaba — kept observable, kept honest.
I've spent five years keeping production systems calm — across cloud, containers, and observability. My focus is the unglamorous work that makes everything else possible: automating the toil, owning the incident, and shipping the runbook.
Four pillars I keep returning to — the lens I use to evaluate a system, a process, or my own work.
10,000+ servers and 400,000+ cloud assets across AWS, Azure, GCP and Alibaba — kept observable, kept honest.
Terraform, GitLab CI, Helm, Chef. If a human is doing it twice, I'd rather write the script before the third time never gets done.
Operated five enterprise platforms (Datadog, Dynatrace, Zabbix, Site24x7, PagerDuty). Escalation point for critical incidents, with RCA ownership.
I work the seam between engineering and ops — translating customer pain into platform improvements, and process changes that actually stick.
The platforms, languages and frameworks I've shipped with in production.
From Tier 1 NOC support to designing AWS infrastructure for global enterprise platforms — the path so far.
DevOps Engineer
DevOps Engineer
NOC Engineer
Network / System Engineer
Ho Chi Minh City
Hiring, collaborating, mentoring, or just curious about an observability stack — I read everything.