Please fix the AWS Free Tier before somebody gets hurt

Let engineers learn cloud without jeopardizing their personal well-being.

Read what the community is saying about this post: Hacker News, r/aws, r/programming

Read news coverage spawned by this post: InfoQ, The Register

I try not to write this newsletter in rant mode. And I’m trying not to rant right now.

After all, the AWS Free Tier has been broken for 10+ years. How urgent of a problem can it be?

But I’ve been shaken all day by this message that appeared in the A Cloud Guru Discord server:

Before anybody worries: this student is fine, they have lots of support now, AWS is on the case.

But all I can think of is that horrible story that appeared during the worst of the pandemic, about the young man who died believing he’d lost hundreds of thousands of dollars on the stock trading app Robinhood.

And I keep thinking: what if this student hadn’t reached out to a developer community? What if AWS Support hadn’t been nudged on Twitter, and had taken a few days to get back? What if the costs (and the panic) had kept spiraling?

Am I being melodramatic? I can hear the objections now.

“It’s the student’s responsibility to know what they’re deploying.”

With all due respect, get out of here with that. Even highly experienced engineers struggle with “bills heard round the world”, but at least they’re usually doing it on company credit cards. Students trying to break into the cloud have no financial buffer, and shouldn’t be penalized for learning. I’m not saying learning should be free! Just that it shouldn’t be a game of resource whack-a-mole.

“It was ‘just’ $200, that’s not the end of the world.”

Sure - this time. What if the student had, say, accidentally written a Lambda function that PUTs and GETs the same object to S3 in an infinite loop? How would they have known? They could easily rack up tens of thousands in costs before the billing console even refreshed.

“AWS Support is great about refunding these types of claims, there’s no reason to be alarmed.”

You know that. I know that. The 20-year-old student staring at an unexpected $200 charge didn’t know that. How could they? It’s not a documented resolution path.

Anyway, credit to the student who wrote that message in Discord - despite their panic, they’ve laid out exactly the two worst problems with AWS’s current “free/not free/just kidding/good luck” approach to free accounts:

  • Unexpected charges

  • Inability to find what’s causing the charges

AWS is the only cloud provider that creates these problems. Azure, GCP, even Oracle all give you ways to set billing limits and/or delete a project and feel sure that it’s totally deleted.

On the other hand, I’ve personally got a dormant AWS account that’s charging me cents every month, and I bet you do too. I’m not at all confident that I could figure out where those charges are coming from, and I’m an “AWS Hero”. It would be easier just to destroy the account.

And, come on, if the only sure solution to closing out the tab on your AWS project is “cancel your credit card and nuke the account”, that’s not a great way to keep customers, is it?

Is there a solution here?

Corey Quinn, your first and last stop for any question that touches AWS billing, has called for an updated free tier that treats “personal learning” AWS accounts differently from “new corporate” accounts, and sets hard billing limits that you can’t exceed.

We could also consider time-limited sandbox accounts that automatically shut themselves down after a period of time; this is the solution A Cloud Guru/Linux Academy has used, with success, for their popular Cloud Playgrounds. But as an ACG employee I’m happy to tell you that feature should be in the AWS console; ACG shouldn’t have had to build it.

Updated 10 PM ET 3/4 - Some have pointed out the existence of AWS Educate Starter Accounts, which give no-credit-card access to a limited but useful subset of AWS services. The problem is that you can only get access to these accounts through student affiliation with a participating educational institution like a high school or university.

It might be more feasible to expand this program, say to any applicant who demonstrates some reasonable threshold of non-bot-ness, than to re-engineer the normal free tier.

In the meantime, if none of this is feasible - AWS, the least you can do is stop talking about training 29 million new engineers on your platform by the year 2025. Until those people have a safe way to learn without jeopardizing their personal well-being, that promise isn’t merely unachievable; it’s irresponsible.

Announcing a new thing

If you love Black Mirror, you might like this.

Hello Cloud Irregular regulars,

Some of you know that in addition to my day job in cloud and my overly colorful attempts at nonfiction, I’m also an SFWA-qualified science fiction writer. My work tends to focus on the near future, and in particular the uncertain impact of AI and ubiquitous computing on our individual and collective well-being. Basically, Black Mirror-type stuff.

My work up til now has mostly been published in traditional sorts of places; in fact, my recent Abyss & Apex story Russian Rhapsody was just nominated for the 2021 Eugie Foster Award.

But today I’m launching a new fiction project in a new format: a science fiction novel that unfolds as a weekly email newsletter.

Eternity Hacks is both a parody of Peak Newsletter culture and a deep dive into the social eternity: a closer-than-you-think future where anyone can live forever … as long as your idea of “live forever” is “Big Tech simulates you as a chatbot using your lifetime of online activity as training data.”

To be clear, Eternity Hacks will not affect or overlap with Cloud Irregular. The newsletter you are reading right now will continue to be the home for my cloud ramblings, and I will not spam this list with updates about the fiction project. So this is your chance to check out Eternity Hacks and subscribe (it’s free, of course!) at newsletter.eternityhacks.com.

The first two bite-size installments are up now, but this is going to get way stranger than you think.

Links and Events

The good people of NetApp asked me (probably on the strength of this nonsense) to join a collegial debate on serverless vs containers. Somewhere along the way, Bruce Buffer got involved. In the end I got ten minutes to explain why FaaS is dead and serverless is winning anyway. That video contains probably the most concise summary of everything I’ve been trying to say about serverless for the last 5 years.

On Wednesday I’m joining a panel with about 80 combined years of IT experience to roleplay our way through a challenging cloud migration scenario. The twist: you, the audience, will throw technical and organizational wrenches into our wheels as we go along. It should be an entertaining mess at the very least.

And two weeks after that I’m filled with impostor syndrome to be hosting a group of incredible people including Donovan Brown, Jessica Kerr, Kesha Williams, and Matt Stratton to discuss the future of DevOps in the post-COVID world.

Finally, I interviewed Brent Ozar about what’s up these days with Microsoft SQL Server, the most beloved database your startup will never use.

Just For Fun

I recorded this back in 2018 but the recent OVH data center fire story convinced me against my better judgment to bring it out of mothballs. I’m so, so sorry for The Disaster Recovery Song:

You're thinking about docs all wrong

"RTFM" is not the full story

The career-changing art of reading the docs

Hello! Usually I include the longform essay right here in the email, but today I’m linking it from ACG: https://acloudguru.com/blog/engineering/the-career-changing-art-of-reading-the-docs

If you read and internalize only one thing I’ve written in the past year or so, I hope it’s this piece. I’ve seen the strategy outlined here change a lot of careers, including mine.

Links and events

Join me and Microsoft’s Scott Hanselman tomorrow for what’s sure to be an enlightening hour, as we learn how Scott has migrated some seriously legacy web apps to Azure.

Here’s a fun and wide-ranging interview I did with High Scalability’s Todd Hoff about The Read Aloud Cloud, choosing a traditional publisher rather than self-publishing, the greatest cloud service of all time, and much more.

And another interview, me asking the questions this time, with Marianne Bellotti. App modernization, lies we believe about mainframes, and why COBOL will be with us forever.

Finally, I joined The Internet Report to talk cloud resilience in the wake of The Kinesis Incident. To reiterate: everybody has downtime and multi-region/multi-cloud is expensive, it’s your responsibility to know how much redundancy you should pay for!

Just for fun (probably)

The Year in Cloud: 2020

Kiss 2020 goodbye forever. But first, read these top posts

One side effect of being stuck at home for 9 months: I spent more time writing than I had planned. Between A Cloud Guru and this newsletter, I wrote something like 50,000 words about the cloud in 2020, the length of a reasonable-sized book. And that’s not including the actual book.

Again, very little planning went into this - I was mostly just writing about whatever seemed interesting at the time - but looking back over, some clear themes emerge.

Here are links to some of my most popular pieces from 2020, arranged around the big ideas they explore (and a few of my favorite cartoons from this year as well).

Lift-and-shift: a great start, a dangerous stop

Dead tree, donut, asteroid: where cloud migrations get stuck

The lift-and-shift shot clock (also adapted into a couple of conference talks, including this one from OSCON)

The central cloud team: a pervasive, often doomed pattern

How your org predicts your CI/CD pipeline

Why central cloud teams fail (and how to save yours)

The strange case of AWS Proton: Conway’s Law-as-a-Service

Multi-cloud: inevitable or fantasy? (Yes)

AWS hearts multi-cloud? It’s gonna happen

The cold reality of the Kinesis Incident

AWS just went multi-cloud … and it’s only the beginning

Serverless: maturing into the mainstream

Code-wise, cloud-foolish (also delivered as a talk at ServerlessDays)

In praise of S3, the greatest cloud service of all time

You’re not ready for feature flags

Building aggregations with DynamoDB Streams

AWS Lambda is winning, but first it had to die

Building the post-COVID cloud…

Tech in the time of COVID-19

A war story about COVID, cloud, and cost

COVID and cost optimization: lessons from leaders

Trends to watch from re:Invent 2020

…and its people

The Cloud Resume Challenge (which lives on as a Discord community and a monthly challenge from A Cloud Guru)

How many certifications do I need to get a cloud job?

The best way to find a cloud job

Interesting interviews

Scaling HEY on AWS and Kubernetes (with Basecamp’s Blake Stoddard)

Is cloud security too easy to screw up? (with HaveIBeenPwned’s Troy Hunt)

Optimizing a multimillion-dollar cloud bill (with ACG’s John McKim)

and finally … Viral Nonsense

The resilience of the AWS Services Song remains baffling to me. It’s been viewed millions of times across various platforms. Most people who watch it have no idea what it is about. And yet even now, every once in awhile I’ll get a bunch of notifications from Colombia or somewhere and it’s fun to watch another community discover it for the first time.

In 2021, I resolve to record more silly songs.

Links and events: what’s next?

I think that is enough links for awhile. So let’s talk about the future!

This mailing list has more than doubled in size since January 2020 - there are well over 2,000 of you now! - which is cool but also a bit intimidating.

I have no motivation to do Cloud Irregular other than that I like to write about the cloud, and no plans to do anything obnoxious like set up a paywall. I have had some people ask about sponsorships, and I’m not opposed to that, but it would have to be something useful to this community. If you have marketing budget and want to talk to a couple thousand cloud people - reach out, I guess.

In 2021 I do plan to keep the cloud essays going, but I’d also like to explore some different formats - maybe even an option to subscribe just to cartoons. What sounds interesting to you? Feel free to comment or reply and let me know if you have any suggestions or feedback.

Just for fun

2020 wasn’t all bad … we got strong consistency for S3! So I had to update the S3 Ballad for the occasion. Happy holidays. Can’t wait to see you all in person again.

The cold reality of the Kinesis Incident

It was a systemic failure, not a random event. AWS must do better.

This holiday season, give your non-technical friends and family the gift of finally understanding what you do for a living. The Read Aloud Cloud is suitable for all ages and is 20% off at Amazon right now!

One of my pet hobbies is collecting examples of complex systems that people spend lots of time making resilient to random error, but which are instead brought down by systematic error that nobody saw coming.

Systematic error - as opposed to a random, isolated fault - is error that infects and skews every aspect of the system. In a laboratory, it could be a calibration mistake in the equipment itself, making all observations useless.

Outside the laboratory, systematic error can mean the difference between life and death, sometimes literally. The hull of the Titanic comprised sixteen “watertight” compartments that were supposed to seal off a breach, preventing individual failures from spreading. That’s why the shipbuilders bragged that the Titanic was unsinkable. Instead, systemic design error let water spill from one compartment to the next. The moment an iceberg compromised one section of the hull, the whole ship was doomed.

Or take election forecasting. In 2016, pretty much every poll predicted a landslide win for Hillary Clinton — a chorus of consensus that seemed safely beyond any mistake in an individual poll’s methodology. But it turned out that pollsters systemically underestimated enthusiasm for Donald Trump in key states — maybe because of shy Trump voters or industry bias, nobody really knows. If they had known, then maybe every poll wouldn’t have made the same mistakes. But they did. Four years later, nobody trusts the election modeling industry anymore.

That brings us to Wednesday. Several AWS services in us-east-1 took a daylong Thanksgiving break due to what shall be henceforth known around my house as The Kinesis Incident — which sounds like a novel in an airport bookstore, if Clive Cussler wrote thrillers about ulimit.

Please read AWS’s excellent blow-by-blow explanation for the full postmortem, but to sum up quickly: on Wednesday afternoon, AWS rolled out some new capacity to the Kinesis Data Streams control plane, which breached an operating system thread limit; because that part of KDS was not sufficiently architected for high availability, it went down hard and took a long time to come back up; and in the meantime several other AWS services that depend on Kinesis took baths of varying temperature and duration.

Hot takes vs cold reality

Plenty of hot takes have been swirling around AWS Twitter over the long weekend, just as they did after the great S3 outage of 2017. Depending on who you listen to, The Kinesis Incident is …

A morality tale about OS configs! (I mean, sure, but that’s not the interesting part of this.)

Yet another argument for multi-cloud! (No it isn’t, multi-cloud at the workload level remains expensive nonsense; please see the previous Cloud Irregular for an explanation of when multi-cloud makes sense at the organization level.)

An argument for building multi-region applications! (This sounds superficially more reasonable, but probably isn’t. Multi-region architectures — and I’ve built a few! — are expensive, have lots of moving parts, and limit your service options almost as much as multi-cloud. Multi-region is multi-cloud’s creepy little brother. Don’t babysit it unless you have to.)

An argument for *AWS’s internal service architectures* being multi-region! (I have no idea how this would work compliance-wise. And I think it would just make everything worse, weirder, and more confusing for everyone.)

Forget the hot takes. Here’s the cold reality: The Kinesis Incident is not a story of independent, random error. It’s not a one-off event that we can put behind us with a config update or an architectural choice.

It’s a story of systemic failure.

The cascade of doom

Reading between the lines of the AWS postmortem, Scott Piper has attempted to map out the internal dependency tree of last week’s affected services:

The graphic in Scott’s tweet actually understates the problem — for example, no Kinesis also means no AWS IoT, which in turn meant a bad night for Ben Kehoe and his army of serverless Roombas, not to mention malfunctioning doorbells and ovens and who knows what else.

Now, IoT teams understand that their workloads are deeply intertwined with Kinesis streams. But who would have expected a Kinesis malfunction to wipe out AWS Cognito, a critical but seemingly unrelated service? The Cognito-Kinesis integration happens under the hood; the Cognito team apparently uses KDS to analyze API usage patterns. There’s no reason a customer would ever need to know that … until someone has to explain why Kinesis took down Cognito.

But it gets worse. According to the postmortem, the Cognito team actually had some caching in place to guard against Kinesis disappearing; it just didn’t work quite right in practice. So these individual service teams are rolling their own fault-tolerance systems to mitigate unexpected behavior from upstream dependencies that they may not fully understand. What do you want to bet Cognito isn’t the only service whose failsafes aren’t quite perfect?

(This is not a story of random error, this is a story of systemic failure.)

The more, the scarier

The edges in AWS’s internal service graph are increasing at a geometric rate as new higher-level services appear, often directly consuming core services from Kinesis, DynamoDB, and so on. Some bricks in this Jenga tower of dependencies will be legible to customers, like IoT’s white-labeling of Kinesis; others will use internal connectors and middleware that nobody sees until the next outage.

Cognito depends on Kinesis. AppSync integrates with Cognito. Future high-level services will no doubt use AppSync under the hood. Fixing one config file, hardening one failure mode, doesn’t shore up the entire tower.

The only conclusion is that we should expect future Kinesis Incidents, and we should expect them to be progressively bigger in scope and harder to resolve.

What’s the systematic failure here? Two-pizza teams. “Two is better than zero.” A “worse is better” product strategy that prioritizes shipping new features over cross-functional collaboration. These are the principles that helped AWS eat the cloud. They create services highly resilient to independent failures. But it’s not clear that they are a recipe for systemic resilience across all of AWS. And over time, while errors in core services become less likely, the probability builds that a single error in a core service will have snowballing, Jenga-collapsing implications.

Really, the astonishing thing is that these cascading outages don’t happen twice a week, and that’s a testament to the outstanding engineering discipline at AWS as a whole.

But still, as the explosion of new, higher-level AWS services continues (‘tis the season — we’re about to meet a few dozen more at re:Invent!) and that dependency graph becomes more complex, more fragile, we should only expect cascading failures to increase. It’s inherent in the system.

Unless?

AWS’s own postmortem, when it’s not promising more vigilance around OS thread hygiene, does allude to ongoing efforts to “cellularize” critical services to limit blast radius. I don’t fully understand how that protects against bad assumptions made by dependent services, and I’d be willing to bet that plenty of AWS PMs don’t either. But it’s time to build some trust with customers about exactly what to expect.

I’ve called for AWS to release a full, public audit of their internal dependencies on their own services, as well as their plan to isolate customers from failures of services the customer is not using. Maybe everything’s fine. Maybe the Kinesis Incident was an anomaly, and AWS won’t suffer another outage of this magnitude for years. But right now I don’t see a reason to believe that, and I’m sure I’m not alone.

Links and events

On that note, re:Invent starts on Monday! It’s three weeks long! It’s free, it’s virtual, it’s just. so. much. I’m going to try to send out a short executive summary of each day via email at A Cloud Guru. Make sure you follow the blog over there for lots of analysis and new feature deep dives from me, other AWS Heroes, and even some special AWS service team guests. I’ll probably pop up in a few other places as well.

Irish Tech News has a nice review out of The Read Aloud Cloud. “For the most part, the rhyming works”, they concede. I’ll take it!

Just for fun

Loading more posts…