The backup jobs were running. They just hadn't been tested. Nobody knew until we tried to restore.
That's the most common thing we find in a Production Risk Audit. Not exotic vulnerabilities or architecture mistakes — just the ordinary gaps that accumulate when a small team is heads-down on product and nobody owns infrastructure full-time.
Here's exactly what the audit covers, what you'll get out of it, and how to know if it's right for you.
What we review
The audit covers four areas:
Architecture and deployment. How your application is deployed, what your rollback story is, whether you have single points of failure, and how changes get from a developer's laptop to production. We're not opinionated about technology choices — we're looking for gaps in process and reliability.
Security posture. Access controls (who can SSH into what, how credentials are managed), open ports and network exposure, patch status, and secrets handling. Most teams have at least one thing in this category that would make a security engineer wince.
Backup and recovery. We verify that backups exist, that they're being tested, and that you could actually restore from them under pressure. "We have backups" and "we can recover from a failure in under an hour" are two very different things.
Monitoring and alerting. What observability you have in place and what's missing. Do you know when your application is down before your users do? Can you debug a production incident at 2am with what you have?
What you provide
Read access to your infrastructure. That typically means:
- SSH access (read-only is fine for most things)
- Access to your monitoring dashboards, if they exist
- Any relevant configuration files or runbooks
- A 30-minute kickoff call to understand your stack and what you're most worried about
We don't need write access. We won't make any changes to your systems during the audit.
What you get
A written report with a summary of your current state, a prioritized list of findings ranked by severity and effort to fix, specific actionable recommendations for each finding, and a flag on anything time-sensitive.
The report is written for engineers, not executives. It's direct and specific and doesn't pad the word count with disclaimers.
After delivery, we schedule a 30-minute debrief call to walk through the findings and discuss next steps.
What most teams find
A few things come up in almost every audit:
Backups that haven't been tested. Backup jobs are running but the restore process has never been verified. You find out it's broken when you actually need it.
Overly permissive access. Developer machines with production SSH keys. Service accounts with admin privileges. Credentials that were shared once and never rotated.
Missing alerting. Dashboards exist but there are no alerts attached to them. The team finds out something is wrong when a user reports it.
No rollback plan. If a deploy breaks production right now, what's the procedure? Many teams don't have a clear answer — or the answer is "redeploy and hope."
None of these are unusual. They're the natural result of moving fast with a small team. The audit surfaces them so you can decide what to fix before something forces your hand.
Is the audit right for you?
It's a good fit if your team has been heads-down on product and infrastructure has been running on autopilot. It's also useful if you've recently scaled up and want to know whether your setup can handle where you're going, or if an enterprise prospect has started asking security questions you can't fully answer.
It's not a fit if you have a dedicated SRE team already doing this work, or if you're looking for someone to implement changes rather than audit (that's what the retainer is for).
If you're not sure whether it makes sense, reach out and we can talk it through in a few minutes.