Professional headshot of Devin Bess

Devin Bess

Senior Site Reliability Engineer / Architect

Architect and Senior SRE focused on observability, distributed systems, and developer tooling. I write Python, Bash, and IaC and I care about keeping things running.

Skills

Languages

  • Go
  • Python
  • HTML/CSS
  • JavaScript/TypeScript
  • Bash

SRE & Observability

  • Prometheus
  • Grafana
  • DataDog
  • Loki
  • OpenTelemetry
  • Kubernetes
  • Docker

Cloud & IaC

  • AWS
  • OCI
  • Terraform
  • Pulumi
  • ArgoCD

Automation & CI/CD

  • Buildkite
  • GitHub Actions
  • GitLab
  • OCI-SCM
  • Jenkins
  • Ansible

Databases

  • MySQL
  • PostgreSQL
  • DynamoDB
  • MongoDB

Systems & Security

  • Linux/Kernel
  • Networking (TCP/IP
  • HTTP/gRPC)
  • Security (OAuth2)

Projects

The Observability Stack

  • Loki
  • Tempo
  • Grafana
  • OpenTelemetry
  • Helm

Fragmented logging and tracing across multiple vendor tools made cross-service debugging slow and expensive.

Outcome › In-progress — unifying logs, metrics, and traces under a single Grafana experience for personal infrastructure.

OCI Sovereign Cloud Incident Automation

  • Python
  • OCI APIs
  • OCI Alarms
  • O.C.E.A.N.
  • FedRAMP/IL5

Manual My Oracle Support (MOS) Service Request triage and stakeholder notifications slowed mean time to resolution for critical Sovereign Cloud incidents.

Outcome › Automated SR creation, severity triage, and notifications via Python + OCI APIs; measurably reduced MTTR across low-side and high-side government realms.

CAB Workflow Automation

  • Python
  • OCI APIs
  • GitHub Actions
  • Buildkite

Manual Change Advisory Board review created a deployment backlog for OCI service teams, slowing safe release velocity.

Outcome › Engineered automation pipelines that eliminated manual change request processing, cut the review queue, and accelerated deployments for the Sovereign Cloud org.

Twilio Resilience Platform

  • PagerDuty
  • Rollbar
  • Prometheus
  • Python
  • Ansible

Production outages required manual triage that extended downtime windows and risked SLA breach across distributed services.

Outcome › Cut downtime 40% with automated monitoring, runbooks, and recovery tooling; sustained 99.99% availability and reduced manual operational effort by 50%.

Writing

Building a Resilient On-Call Culture ↗

On-call isn't just alert response — it's building a blameless, high-trust team that learns from production incidents fast.

P99 Latency: What It Means & How to Fix It ↗

P99 latency measures the slowest 1% of your requests — a better health signal than averages for catching tail latency issues in production.

Cutting Cloud Spend Without Hurting Reliability

Which cloud dollars are doing real work, and which ones are just vibing? An SRE's guide to spending less without breaking everything.