Job Description: Lead Resilience Operations (ResOps) Engineer

Location: Remote / Hybrid / HQ 

Reporting to: Chief Risk Officer (CRO) or Chief Operating Officer (COO) 

The Mission: You aren't here to keep the lights on; you’re here to ensure they stay on when the "unthinkable" happens. As the Lead ResOps Engineer, you will sit at the intersection of Engineering, Security, and Business Continuity, architecting systems that are not just robust, but anti-fragile.


Key Responsibilities

  • Mapping the "Invisible" Infrastructure: You will lead the "Dependency Mapping" initiative, identifying shadow dependencies and Nth-party risks that could trigger cascading failures across our ecosystem.
  • Designing "Severe but Plausible" Chaos: You will move beyond simple failover tests. You’ll design and execute "Game Days" that simulate complex, multi-vector shocks (e.g., a massive cloud outage combined with a ransomware lockdown).
  • Defining Impact Tolerance: You will collaborate with executive leadership to translate "99.9% uptime" into meaningful Business Impact Tolerance metrics. You’ll define the "breaking point" for every critical business service.
  • The Remediation Loop: When a vulnerability is found during a stress test, you won't just write a ticket. You will work with DevOps and ITOps to ensure resilience-debt is prioritized alongside feature-velocity.

The Ideal Hybrid Persona

  • The "Battle-Scarred" Engineer: 7+ years in SRE, SecOps, or Infrastructure. You’ve lived through a "Level 1" outage and know exactly how communication breaks down when the "pager" goes off.
  • The Systems Thinker: You don't see a server; you see a node in a global graph. You understand how a delay in a third-party analytics script can theoretically bring down a checkout flow.
  • The Diplomat: You can explain to a Product Manager why delaying a feature launch to improve "Mean Time to Recovery of Logic" (MTRL) is a win for the customer.

Technical & Theoretical Toolkit

  • Observability & Mapping: Experience with OpenTelemetry, Graph Databases (Neo4j), or Service Mesh (Istio/Linkerd).
  • Chaos Engineering: Proficiency with tools like Gremlin, Chaos Mesh, or AWS Fault Injection Simulator.
  • Frameworks: Familiarity with DORA (Digital Operational Resilience Act) or the NIST Cybersecurity Framework.
  • Crisis Logic: Strong understanding of "Circuit Breaker" patterns and "Graceful Degradation" architecture.


Why This Role is Different

Unlike a standard Engineering role, you report into Risk/Operations. Your success is not measured by how many features we ship, but by how "boring" our last major vendor outage was because your systems handled it automatically.


Metrics for Success

1. MTRL: Mean Time to Recovery of Logic

This is the "North Star" of ResOps. While MTTCR (Mean Time to Clean Recovery) measures how long it takes to fix the tech and recovery cleanly after cyber incidents, MTRL measures how long it takes for the business to function again.

  • The Scenario: Your database is corrupted. MTTCR is 4 hours (to restore the backup).
  • The MTRL Win: Within 5 minutes, your ResOps "Circuit Breaker" kicked in, allowing users to browse a cached version of the site and take "offline" orders to be processed later.
  • Metric Goal: Minimize the gap between the failure and the "Limited Functionality" state.

2. The "Blast Radius" Ratio

This measures the effectiveness of your Dependency Mapping and Siloing.

  • The Calculation: {Impacted Services}/{Total Services} during a single point of failure (SPOF) event.
  • The Goal: If your 3rd-party Payment Gateway goes down, your "Product Search" and "Customer Support Chat" should remain at 0% impact.
  • Success Look: A decreasing ratio over time, proving that failures are being successfully isolated.

3. Drift to Danger (The Resilience Decay)

Systems naturally become more fragile over time as new features are added. This metric tracks how often your "known" resilience patterns are bypassed.

  • The Measure: The number of new production services deployed without a verified "Graceful Degradation" path.
  • The Goal: 0. ResOps success is defined by ensuring that "Resilience Debt" doesn't accumulate faster than technical debt.

4. Shadow Dependency Coverage

You cannot protect what you haven't mapped. This measures the "blind spots" in your architecture.

  • The Measure: Percentage of 3rd and 4th-party vendors (N-th party risk) identified and stress-tested in the last quarter.
  • Cite-worthy Note: According to Accenture’s 2024 Resilience Report, "hidden" dependencies in the supply chain account for nearly 60% of modern systemic outages.

Comments

Popular posts from this blog

I Took My Own Advice in an Interview. Pure Storage Didn't Flinch.

If I do the homework, you owe me a phone call. The death of decency in hiring.

The One Question That Terrifies Candidates But Wins Offers - It's not "How's the Culture?"