Almost every software developer out there loves a greenfield project. That's when life is fun because you're just building stuff without a lot of worry. You're making a thing. It's only later, when you've deployed and the money starts rolling in that it starts, slowly but surely, to become about something else. And I'm not talking simply about the field changing colors from green to brown. I'm talking about the enterprise-y concern of risk mitigation. I mean, wow. It even sounds boring. At some point, your project goes from one of limitless possibilities in the pursuit of creation to one of digging in to defend your turf against slipped deadlines, busted budgets, and irate users. And it's not just that drudgery happens here. So too do stress and existential job worry. Risk mitigation is a management concern. And management concerns can suddenly become developer concerns in profound, mortifying ways.
A Tale of Slight Database Migration
Let's look at a problem that doesn't happen in the early days of a greenfield project. But it does happen once you've release your software into the wild and set about defending your turf. Say you've put greenfields in your rearview mirror and you have responsibility for a production e-commerce web app. In this app, not surprisingly, you have customers and they have customer data. Back in the heady days of the greenfields, you made a regrettable decision to have an enumeration stored in your database called "CustomerType," and it wound up keying some other helper tables that amount to so much cruft in your schema. After living with mounting technical debt along these lines for a while, you said "enough!" Time to bite the bullet and remove this construct during the course of a fairly major release. You won't need that enumeration column or any of its awkward related tables any longer. Instead, you've reconsidered your object graph, realizing that you can infer the customer type in your application code, using other stored data. And so you queue up a major release with an accompanying database migration script. You announce an outage to your users, and you prepare your SQL migration scripts alongside your application code deployment.
The Migration Gone Wrong
The release time window comes, and you push the code to the server. Then you point the schema migration scripts at the database and let 'er rip. You've made backups, of course. As the lights come back on and you do some quick internal sanity testing, everything seems good. So you make the server publicly accessible once again, feeling reasonably confident. Users start to trickle and then pour back into the system and data begins to flow. People are buying stuff, updating their records, and things are going according to plan. Er, well, mostly according to plan. Celebratory beer in hand, you pause before cracking it open because you notice something...odd. Receipts aren't processing yet. You fire up your query browser and look in the database with a mounting feeling of cold dread. Receipts aren't just late in processing -- they don't exist. Frantic querying, scrambling, and double checking in source code confirms your worst fears. The receipt generation module retains a vestigial dependency on the now-non-existent customer type field. You're selling stuff to people. You're shipping them items. But you're not generating or storing receipt information. You have no record of the purchases being made. What do you do now?
Feature Flags as Risk Mitigation
At this point in your deployment, you face essentially two terrible options. You can roll forward, scrambling to patch the software, test it (okay, who are you kidding), and deploy it, fixing the problem. Or you can roll back. But rollback involves restoring the old version of database and application, effectively wiping out all transactions since go-live. You'll have to hunt down all of those new customers and transactions, manually refunding them and recreating their data in your old schema. These are the sort of (intense) growing pains that invariably test developers and organizations at some point in their lives. And all sorts of strategies exist for mitigating this type of risk. But I'd like to propose one that may not seem immediately obvious: the feature flag. You probably think of feature flags in the context of simple, visual items (and the code supporting them). Deploy the new look and feel for the customer admin screen, but with a toggle allowing you to revert to the old one. But feature flags can exist at all levels of your application, with all forms of conceptual feature. Imagine a world in which you approached the "customer type" problem a bit differently. Instead of excising it from code and schema in one fell swoop, you added the new processing logic, guarded by a feature flag. You could then have made a minor deployment of the application without any schema changes. Then you could have turned the new processing on for a single customer and audited the results with an extremely small amount of risk. After that, turn it on for 10, and then 100, and then everyone if it went well. And only then would you migrate the database schema, dropping the cruft that no one uses anymore anyway.
Thinking Bigger: the Forklift Upgrade
As long as you're now thinking of feature flags as more than just cosmetic and visually oriented, let's think even bigger. As a consultant that frequently helps enterprise programs run more smoothly, I see an incredibly common flavor of mistake. I'm talking about the forklift upgrade. The reasoning is understandable. You've got an application that relies on some aging and arcane database technology and you decide to migrate to something modern so that you can actually hire people with appropriate skill sets. What do you do? Well, you undertake a massive project to update everything in your codebase that refers to the old database and have it refer to the new one. This takes years, gigantic migration plans, whiteboards full of ETL strategy, blood, sweat and tears. When the day finally comes, you take a deep breath, deploy the software, flip a gigantic switch, and pray. I get it. But please, please don't do it this way. It's a massive risk. And if you think the rollback from the last story was bad, this one will be somewhere between catastrophic and impossible.
Feature Flags in the Data Access Code
With a migration, I recommend that you embrace the agile concept of small, thin slices. Deploy a small version of the new database with only a customers table, for instance. And then deploy the next version of your application with feature flags around writes to the customer table. You'll have a flag for the new and a flag for the old so that you can write to either or both. Now, you can start a parallel writing scheme, gradually populating the new database with data in parallel. This lets you test for integrity in the new database as you go, without risk. If anything goes wrong, you can simply turn off that flag and have your old system back. And it lets you gradually build the new schema, table by table, write by write, until you have a parallel database in full. Once you've got that, you can start flagging reads as well. Incrementally, over months or years, you can phase in the new database while phasing out the old. The forklift upgrade method defers literally all of the risk to the moment that you throw that terrifying switch. The flagging method spreads the risk out in a tiny, thin layer over everything you do, rendering it pretty inconsequential.
Feature Flags as Architectural Strategy
The main, broader takeaway here has to do with how you conceive of feature flags. As I mentioned before, many people think of them just in terms of turning on and off bits of user-facing functionality. And, indeed, that makes for an excellent use case. But you should contemplate their use as part of your architectural strategy as well, including with various flavors of database migrations. You need to take care to avoid complexity and technical debt with this strategy but, done right, they'll give you a much more powerful and flexible playbook. They'll help you mitigate risk to make everyone happy, and they'll help developers get back to the joy of building a thing and not worrying about deployment horror stories.