Engineering Excellence: the system that makes speed sustainable

rewrite of original EE article using chatgpt

Engineering Excellence (EE) isn’t a slogan, a committee, or a “quality week.” It’s the set of practices that lets a team ship reliably at scale—with fewer regressions, lower on-call load, predictable releases, and compounding developer productivity.

At its core, EE is pride in what we build. But pride without a system becomes heroics. EE is the system.

What EE is not: maximizing daily commit counts, rewarding “busy-ness,” or shipping features by taking on invisible risk. If a team measures velocity primarily by commits/day, that’s a vanity metric. A better measure is: how quickly and safely can we deliver a production change?

EE can feel at odds with release velocity in the short term because it avoids shortcuts and invests in guardrails (security, testing, observability, documentation). Over time, though, EE is what removes friction and makes speed sustainable.

Below is a practical checklist you can use to assess where you are today—and where to invest next.


The five pillars of Engineering Excellence

1) Delivery system: make shipping boring

Excellence shows up when production changes are routine, not an “event.”

  • Protected main/release branches (no direct pushes; required reviews; required checks)
  • A predictable release process (one-click deploy, or at least a documented, repeatable process)
  • A deployment dashboard (what’s running where, who deployed, and when)
  • Rollbacks are real (tested, practiced, and fast)
  • At least one production-like pre-prod gate (staging, canary, or equivalent)

If releases are painful, teams will avoid releasing—and risk will pile up.


2) Code health: quality is a habit, not a phase

EE is the day-to-day discipline that prevents entropy.

  • Code review that reduces risk without stalling work

    • Clear ownership (e.g., CODEOWNERS / domain owners)
    • Blockers vs nits are distinguished
    • Escalation path when reviewers disagree (sync reviews > comment wars)
  • PRs linked to work items (so “why” is preserved, not just “what”)

  • Trunk-based development or close equivalent

  • A consistent policy on merges

    • Squash merges can reduce noise
    • A linear-ish history improves bisectability and debugging

A note on reviews: in large orgs, “everyone must approve” creates stalls and politics. My preference is: one accountable approver (owner) + optional reviewers, with clear escalation for disagreements. The goal is learning and risk reduction—not gatekeeping.


3) Testing strategy: invest where failure is expensive

“Test everything” is not a strategy. EE is choosing the right mix:

  • Automated build + tests on every PR
  • Main branch builds are always green (or treated as an emergency)
  • Meaningful integration tests (covering real service boundaries)
  • E2E tests in at least one environment that is “as real as possible”
  • Negative-path testing (your catch blocks and error conditions)
  • Chaos testing (bonus) where the blast radius is understood

On mocking: mocking is powerful and also easy to misuse. I’m skeptical of test suites that are mostly mocks because they often validate behavior that doesn’t exist in production. My rule of thumb: mock at the edges, prefer integration tests for real behavior, and be honest about what’s untested.


4) Operational excellence: reliability is part of the product

If you can’t see it, you can’t run it.

  • Instrumentation + APM dashboards (traffic, latency, errors, saturation)
  • Alerting on 5xx + unexpected exceptions (with sane thresholds)
  • A runbook that is actually used (and updated as part of on-call)
  • Audit logs (especially around sensitive data access)
  • Rate limiting / abuse controls for public-facing APIs
  • Performance testing as a habit (load tests, synthetic monitoring, or real-user monitoring)

To me, tests > performance. I would rather invest EE time writing tests than optimizing for performance.


5) Security & access: guardrails beat good intentions

EE includes security because security debt compounds faster than code debt.

  • Federated IAM where possible (your app shouldn’t see user passwords)
  • If you do handle passwords: salt + hash with a modern algorithm, and treat auth as a product
  • Service accounts / managed identity for services
  • Secret manager / vault (no secrets in code or source control)
  • Secret rotation (periodic and automated where feasible)
  • RBAC and least privilege (including production DB access)
  • Data protections (masking, backups, DR, geo-replication where needed)
  • Dependency vulnerability checks + static analysis where appropriate

One simple litmus test: do your guardrails prevent a well-meaning engineer from accidentally doing the wrong thing?


A practical maturity check

If you want a lightweight “where do we stand?” assessment, ask:

  1. How long does it take to ship a safe production change?
  2. What’s our change failure rate and mean time to recovery?
  3. What is the on-call burden (pages/engineer/week) trending over time?
  4. Can a new engineer onboard in a day with the README + docs?
  5. Do our systems prevent common failure modes by default?

Those answers are more revealing than a thousand lines of policy.


My take on linear history and staying synced to main

I prefer a mostly linear commit history because it’s easier to understand, debug, and audit—especially when you’re using tools like git bisect. I was therefore surprised to read this SO post where the top voted answer recommends against it. To me, this is one of those cases where SO is not always correct and you should not blindly accept an opinion based on votes – listen to everyone but make your own decisions. Before Git, many teams used systems that effectively forced “sync-to-latest” before check-in, which reduced integration surprises.

There is a real cost: rebasing/syncing frequently adds overhead. At scale, teams often mitigate that with merge queues, protected branches, and automation.

My point isn’t that there’s one universally correct workflow. It’s that a team should consciously choose a workflow that optimizes for low integration pain, high signal history, and fast rollback/debugging—and then enforce it consistently.


How to start without boiling the ocean

If you’re trying to improve EE in a real org with real deadlines:

  • Pick one pillar per quarter
  • Choose the top 3 risks that cause outages, regressions, or slow releases
  • Add one guardrail per sprint
  • Make success measurable (fewer pages, faster releases, lower rollback time)

Engineering Excellence isn’t perfection. It’s compounding advantage.


This entry was posted in Computers, programming, Software. Bookmark the permalink.

Leave a comment