We went viral with a broken app

Post mortem on a weekend of missed opportunity. And a lesson learned on moving fast and making sure you don't break things.

Last Friday, we went viral on Reddit (500K views, 200 comments) with a blog called Never ship on Fridays. It was one of those articles our developers write to get their engineering rant urges out of the system. 

It went viral with all the usual clash of opinions, corrections, eye-rolls and gotcha! moments. We loved the whole pandemonium. We were so grateful for everyone who contributed, who read the blog, grateful for tons of insightful discussions and even for the smart-asses. 

Some stayed to check out our product. They didn’t get very far, though. The setup process for the first UI tests was broken. Ouch. 

What (probably) went wrong

During the user sign-up process we generate real e2e tests, so there’s multiple browsers running in the cloud taking care of discovering those test cases, by parsing the html and passing it to our LLM-backed AI agent. 

The more people sign-up the more load this causes on our backend servers running in the cloud. We have auto-scaling in place, but at some point, the good old “reddit hug of death” caught up to us and brought our backend to its knees.

This should have not affected our sign-up process of course, but the 2nd step of the signup (taking a screenshot of your actual app in a real browser) is currently run on the same service. We knew it was going to cause scaling problems eventually, but in a startup, you make trade-offs.

When end-to-end tests on production make sense

While we have a great testing coverage by unit and e2e tests (as a testing product company 🙂), we didn't run them regularly on production. We have since remedied this and are dog-fooding ourselves continuously, not just in our development process but also production. We will be alerted when the sign-up breaks again and can scale our backend manually.

Moreover, we will work sooner on disentangling our scaling and infrastructure than originally planned. Which brings us to:

We try not to release on Fridays. Maybe we shouldn’t post blogs either.

May you and your web app stay hydrated!

Octoneers

Read more about software development, testing and AI

December 5, 2023
Daniel Draper

Test your code! #4

Software testing is littered with strong opinions and i'm no different. What, when, who, how and if to test at all is the subject of heated discussions. About the one dogma I'm guilty of - testing what you've built.
software development
testing
October 4, 2023
Veith Röthlingshöfer

On type safety in LangChain TS

The unpredictable and non-deterministic nature of LLM output makes ensuring type safety challenging. What I learned about parsing and error handling of LangChain.
LLM
typescript