Devin Bess

Senior Site Reliability Engineer / Architect

Architect and Senior SRE focused on observability, distributed systems, and developer tooling. I write Python, Bash, and IaC and I care about keeping things running.

Download Resume Contact Me

Skills

Languages

Go
Python
HTML/CSS
JavaScript/TypeScript
Bash

SRE & Observability

Prometheus
Grafana
DataDog
Loki
OpenTelemetry
Kubernetes
Docker

Cloud & IaC

AWS
OCI
Terraform
Pulumi
ArgoCD

Automation & CI/CD

Buildkite
GitHub Actions
GitLab
OCI-SCM
Jenkins
Ansible

Databases

MySQL
PostgreSQL
DynamoDB
MongoDB

Systems & Security

Linux/Kernel
Networking (TCP/IP
HTTP/gRPC)
Security (OAuth2)

Projects

The Observability Stack

Loki
Tempo
Grafana
OpenTelemetry
Helm

Fragmented logging and tracing across multiple vendor tools made cross-service debugging slow and expensive.

Outcome › In-progress — unifying logs, metrics, and traces under a single Grafana experience for personal infrastructure.

GitHub ↗

OCI Sovereign Cloud Incident Automation

Python
OCI APIs
OCI Alarms
O.C.E.A.N.
FedRAMP/IL5

Manual My Oracle Support (MOS) Service Request triage and stakeholder notifications slowed mean time to resolution for critical Sovereign Cloud incidents.

Outcome › Automated SR creation, severity triage, and notifications via Python + OCI APIs; measurably reduced MTTR across low-side and high-side government realms.

CAB Workflow Automation

Python
OCI APIs
GitHub Actions
Buildkite

Manual Change Advisory Board review created a deployment backlog for OCI service teams, slowing safe release velocity.

Outcome › Engineered automation pipelines that eliminated manual change request processing, cut the review queue, and accelerated deployments for the Sovereign Cloud org.

Twilio Resilience Platform

PagerDuty
Rollbar
Prometheus
Python
Ansible

Production outages required manual triage that extended downtime windows and risked SLA breach across distributed services.

Outcome › Cut downtime 40% with automated monitoring, runbooks, and recovery tooling; sustained 99.99% availability and reduced manual operational effort by 50%.

Writing

Fri Aug 23 2024 00:00:00 GMT+0000 (Coordinated Universal Time) 5 min

Building a Resilient On-Call Culture ↗

On-call isn't just alert response — it's building a blameless, high-trust team that learns from production incidents fast.

SRE
On-Call
Incident Management
Engineering Culture
MTTD
MTTR

Tue Apr 07 2026 00:00:00 GMT+0000 (Coordinated Universal Time) 7 min

P99 Latency: What It Means & How to Fix It ↗

P99 latency measures the slowest 1% of your requests — a better health signal than averages for catching tail latency issues in production.

SRE
Observability
Latency
Performance
Redis

Fri May 29 2026 00:00:00 GMT+0000 (Coordinated Universal Time) 5 min

Cutting Cloud Spend Without Hurting Reliability

Which cloud dollars are doing real work, and which ones are just vibing? An SRE's guide to spending less without breaking everything.

SRE
DevOps
Cloud
Automation