If you want to see what operational discipline looks like in an industry where downtime is anything but abstract, look at retail forex brokerages. They run businesses that never close, depend on more third-party services than most B2B SaaS companies, and lose customers permanently every time their platform goes dark for an hour. They’ve had to figure out how to survive infrastructure failures, vendor outages, cyberattacks, and human error — not because they wanted to, but because their P&L wouldn’t let them ignore it.
Here are the lessons that translate.
1. You don’t know your real tolerance for downtime until you write it down
Most founders have a vague answer to the question “how long can our system be down before it’s a serious problem?” They’ll say something like “a few hours, I guess” — and then never think about it again.
Forex brokerages can’t get away with that vagueness. Their answer is measured in minutes. If a trading platform is offline during peak London or New York session, traders move to a competitor and don’t come back. So operators make the answer explicit, using two numbers borrowed from the business continuity playbook:
- Recovery Time Objective (RTO): how long can you be down before the damage is unacceptable?
- Recovery Point Objective (RPO): how much recent data can you afford to lose?
These two numbers determine everything else — what infrastructure you need, how much redundancy you build, how often you back up, whether you need automated failover. A company that says “RTO four hours, RPO one hour” makes very different decisions than one that says “RTO fifteen minutes, RPO zero.”
Most startups never have this conversation. They pay for infrastructure that’s much more (or much less) reliable than they actually need, because they never defined what they need.
2. Single points of failure are usually vendors, not servers
When founders think about resilience, they think about their own servers. Multi-region cloud deployments. Database replicas. Auto-scaling.
But for most companies, the real single point of failure is a vendor. Your auth provider. Your payment processor. Your email delivery service. Your CDN. If any one of those goes down, large parts of your product stop working, and there is nothing your own infrastructure can do about it.
Forex brokerages learned this the hard way. A typical brokerage depends on at least a dozen external services: trading platform providers, payment gateways, KYC verification APIs, liquidity providers, SMS gateways for two-factor authentication. A failure in any one of them takes a piece of the business offline.
The mature operators respond by making the most critical dependencies redundant. They integrate two or three payment processors, not one. They maintain a fallback KYC provider. They have documented runbooks for switching from a primary to a secondary vendor in minutes.
You can apply the same thinking to a startup. Ask: which of our vendors, if it went down for twelve hours, would make our product unusable? Then ask: do we have an alternative we have actually tested? If the answer is no, you have a single point of failure that no amount of multi-region infrastructure will fix.
3. Failover that hasn’t been tested isn’t failover
There is a particular kind of infrastructure that looks great on a whiteboard. Primary database, secondary database, automated promotion on failure. Diagrams with arrows. “We have failover.”
Then the primary actually fails. The promotion script has a bug. The DNS doesn’t update for thirty minutes. The application servers can’t reconnect to the new primary because of a hard-coded hostname someone forgot about. The “fifteen-minute failover” takes three hours, and your engineers learn about every gap in real time, in production, at three in the morning.
Forex brokerages that survive their first major outage are the ones that have practiced. They run failover drills quarterly. They do tabletop exercises where someone calls out a scenario — “primary database server just died” — and the team walks through every step of the response. They time how long things actually take, compare it to their stated RTO, and close the gaps.
This kind of practice feels unnecessary until you need it, at which point it’s the most valuable investment you ever made. A tabletop exercise costs nothing — a one-hour meeting with the right people in the room. A failover drill is a few hours of engineering time per quarter. Neither requires a six-figure infrastructure budget, and neither happens at most startups.
4. Communication is half the recovery
When something breaks, your team’s focus naturally goes to fixing the thing. That’s the obvious priority. But there is a parallel priority that’s almost as important and gets neglected: telling people what’s happening.
Two audiences matter immediately.
Your customers. They will discover the problem on their own within minutes. If they don’t hear from you, they assume the worst — that you’re hiding the issue, that their data is at risk, that you don’t know what you’re doing. The brokerages that handle outages well have pre-written communication templates for the most likely scenarios. They’re ready to deploy through email, in-app banners, and status pages within minutes of the incident starting. The message acknowledges the problem, says what they know, says what they’re doing, and gives a time for the next update.
Your team. Who is the incident commander? Who has the authority to trigger a failover? Who talks to customers? Who escalates to the vendor? In small startups, these roles default to “whoever is awake.” That works for the first few incidents and then breaks down the moment you grow past five engineers or have an incident during off-hours.
The fix is not complicated. Write down an escalation matrix with current phone numbers — not just emails, because your email might be the thing that’s down. Name a backup for every role. Keep the document somewhere that doesn’t depend on the systems that might be failing.
5. Everything that lives in someone’s head is a liability
There is always one engineer who knows how the deployment really works. One person who set up the monitoring. One person who configured the firewall rules. When that person is on vacation and something breaks, the recovery time triples.
Forex brokerages treat this as an operational risk and write it out. Every critical system gets a runbook: how to restart it, how to fail over, how to roll back, what to check first. The runbook lives somewhere accessible during an incident — ideally outside the primary infrastructure, because the primary infrastructure might be exactly what’s down.
For startups, the practical version is simpler: pick the three things that would most hurt you if they broke at 3 AM, and write a one-page runbook for each. The author doesn’t have to be a great writer. The point is that someone who isn’t the original engineer can follow it under pressure.
What this actually costs
The instinct, when reading a list like this, is to file it under “things to do when we’re bigger.” That’s a reasonable instinct if you’re imagining the full enterprise-grade version — multi-region active-active, automated failover, dedicated SREs.
But the version that matters for a small company is much cheaper. Two numbers written down. A list of vendors with backups identified. One quarterly tabletop exercise. Three runbooks. An escalation list with current phone numbers. None of this requires a six-figure infrastructure budget. It requires a few hours of thinking and the discipline to test what you write down.
Brokerages that operate this way aren’t large — many are teams of fifteen to fifty people. They run this discipline because they have seen what happens when they don’t. Most startups will only see it once, when the outage actually happens. The window to prepare is before that first major incident, not after.
For a deeper look at how brokerages structure full disaster recovery and business continuity planning — including the specific infrastructure choices that determine whether an outage takes minutes or days to recover from — Kenmore Design published a detailed operational playbook for forex brokerage disaster recovery that translates well to any operationally intensive business.
The market doesn’t pause while you figure it out. Your competitors don’t wait. The discipline to plan for failure is one of the cheapest competitive advantages a startup can build, and one of the few that compounds.

