Feature Flags
Feature Toggles (often also refered to as Feature Flags) are a powerful technique, allowing teams to modify system behavior without changing code. They fall into various usage categories, and it's important to take that categorization into account when implementing and managing toggles. Toggles introduce complexity. We can keep that complexity in check by using smart toggle implementation practices and appropriate tools to manage our toggle configuration, but we should also aim to constrain the number of toggles in our system.
Basic Feature Flag Birth
- Added some boolean (or other type) toggle value + conditional logic based on this value.
- Make the toggle value configurable and dynamic in order to change the behavior of the system without changing the code.
- Prepare for release, make manual testing on Pre-Production environment and test both code paths.
- Canary releasing
- A/B testing
- Release
Categorization
Feature Flags can be used in various ways in various contexts. It can be tempting to lump all feature toggles into the same bucket, but this is a dangerous path. The design forces at play for different categories of toggles are quite different and managing them all in the same way can lead to pain down the road.
Feature toggles can be categorized across two major dimensions:
- How long the feature toggle will live.
- How dynamic the toggling decision must be.
Release Toggles
Abilities:
- Used to enable trunk-based development for teams practicing Continuous Delivery.
- Allow incomplete and un-tested codepaths to be shipped to production as latent code which may never be turned on.
- Allow product-managers to not show a partly or fully ready feature to end users (for many reasons).
Characteristics:
- Lives form Days to Weeks - generally not stick around much longer than a week or two (although product-centric toggles may need to remain in place for a longer period).
- Changes with deployment - every toggling decision for a given release version will be the same, and changing that toggling decision by rolling out a new release with a toggle configuration change is usually perfectly acceptable.
Experiment Toggles
Abilities:
- Used to perform Multivariate or A/B testing - each user of the system is placed into a cohort and at runtime the Toggle Router will consistently send a given user down one codepath or the other, based upon which cohort they are in.
Characteristics:
- Lives from Weeks to Months - needs to remain in place with the same configuration long enough to generate statistically significant results. (Longer is unlikely to be useful, as other changes to the system risk invalidating the results of the experiment).
- Changes with each request, highly dynamic - each incoming request is likely on behalf of a different user and thus might be routed differently than the last.
Ops Toggles
Abilities:
- Used to control operational aspects of our system's behavior - we might introduce an Ops Toggle when rolling out a new feature which has unclear performance implications so that system operators can disable or degrade that feature quickly in production if needed.
Characteristics:
- Lives from Moths to Years - Most Ops Toggles will be relatively short-lived - once confidence is gained in the operational aspects of a new feature the flag should be retired. However it's not uncommon for systems to have a small number of long-lived "Kill Switches" which allow operators of production environments to gracefully degrade non-vital system functionality when the system is enduring unusually high load.
- Chanes with runtime re-configuration - Since the purpose of these flags is to allow operators to quickly react to production issues they need to be re-configured extremely quickly - needing to roll out a new release in order to flip an Ops Toggle is unlikely to make an Operations person happy.
Permissioning Toggles
Abilities:
- Used to change the features or product experience that certain users receive. - Turning on new features for a set of internal users [is a] Champagne Brunch - an early opportunity to drink your own champagne.
- A Champagne Brunch is similar in many ways to a Canary Release. The distinction between the two is that a Canary Released feature is exposed to a randomly selected cohort of users while a Champagne Brunch feature is exposed to a specific set of users.
Characteristics:
- Lives from Years - When used as a way to manage a feature which is only exposed to premium users a Permissioning Toggle may be very-long lived compared to other categories of Feature Toggles - at the scale of multiple years.
- Changes with each request - Since permissions are user-specific the toggling decision for a Permissioning Toggle will always be per-request, making this a very dynamic toggle.
Managing different categories of toggles
Static vs Dynamic:
- Toggles which are making runtime routing decisions necessarily need more sophisticated Toggle Routers, along with more complex configuration for those routers.
- For simple static routing decisions a toggle configuration can be a simple On or Off for each feature with a toggle router which is just responsible for relaying that static on/off state to the Toggle Point.
Transient vs Long Lived:
- If we're adding a Release Toggle which will be removed in a few days time then we can probably get away with a Toggle Point which does a simple if/else check on a Toggle Router.
- However if we're creating a new Permissioning Toggle with Toggle Points which we expect to stick around for a very long time then we certainly don't want to implement those Toggle Points by sprinkling if/else checks around indiscriminately. We'll need to use more maintainable implementation techniques.
Implementation Techniques
Toggles Configuration
Dynamic routing vs dynamic configuration
It's important to note that there are two ways in which a flag's decisions might change at runtime. Firstly, something like a Ops Toggle might be dynamically re-configured from On to Off in response to a system outage. Secondly, some categories of toggles such as Permissioning Toggles and Experiment Toggles make a dynamic routing decision for each request based on some request context such as which user is making the request. The former is dynamic via re-configuration, while the later is inherently dynamic. These inherently dynamic toggles may make highly dynamic decisions but still have a configuration which is quite static, perhaps only changeable via re-deployment.
Note: on the previous chart we could see all the part of Feature Toggles. Let's understand what each of them means:
- Feature Toggle - a boolean (or any other type) value which is used to determine whether a feature is enabled or disabled.
- Toggle Point - the place where a toggling decision is made.
- Toggle Router - the logic behind the decision.
- Toggle Context - teaches the Toggle Router how to make dynamic, per-request toggling decisions.
- Static Configuration - a configuration which is quite static, perhaps only changeable via re-deployment.
- Dynamic Routing - a dynamic routing decision for each request based on some request context such as which user is making the request.
- Dynamic Configuration - a configuration which is quite dynamic, perhaps changeable via re-configuration.
Prefer static configuration
- Managing toggle configuration via source control and re-deployments is preferable, if the nature of the feature flag allows it.
- Managing toggle configuration via source is like infrastructure as code - allows toggle configuration to live alongside the codebase being toggled, which provides a really big win: toggle configuration will move through your Continuous Delivery pipeline in the exact same way as a code change or an infrastructure change would.
- This enables the full the benefits of CD - repeatable builds which are verified in a consistent way across environments.
- It also greatly reduces the testing burden of feature flags. There is less need to verify how the release will perform with both a toggle Off and On, since that state is baked into the release and won't be changed (for less dynamic flags at least).
- We can easily see the state of the toggle in previous releases, and easily recreate previous releases if needed.
Approaches for managing toggle configuration
Hardcoded Toggle Configuration
The most basic technique - perhaps so basic as to not be considered a Feature Flag - is to simply comment or uncomment blocks of code. For example:
function reticulateSplines(){
//return oldFashionedSplineReticulation();
return enhancedSplineReticulation();
}
Slightly more sophisticated than the commenting approach is the use of a preprocessor's #ifdef feature, where available.
Because this type of hardcoding doesn't allow dynamic re-configuration of a toggle it is only suitable for feature flags where we're willing to follow a pattern of deploying code in order to re-configure the flag.
Parameterized Toggle Configuration
The build-time configuration provided by hardcoded configuration isn't flexible enough for many use cases, including a lot of testing scenarios. A simple approach which at least allows feature flags to be re-configured without re-building an app or service is to specify Toggle Configuration via command-line arguments or environment variables. This is a simple and time-honored approach to toggling which has been around since well before anyone referred to the technique as Feature Toggling or Feature Flagging. However it comes with limitations. It can become unwieldy to coordinate configuration across a large number of processes, and changes to a toggle's configuration require either a re-deploy or at the very least a process restart (and probably privileged access to servers by the person re-configuring the toggle too).
Toggle Configuration File
Another option is to read Toggle Configuration from some sort of structured file. It's quite common for this approach to Toggle Configuration to begin life as one part of a more general application configuration file.
With a Toggle Configuration file you can now re-configure a feature flag by simply changing that file rather than re-building application code itself. However, although you don't need to re-build your app to toggle a feature in most cases you'll probably still need to perform a re-deploy in order to re-configure a flag.
Toggle Configuration in App DB
Using static files to manage toggle configuration can become cumbersome once you reach a certain scale. Modifying configuration via files is relatively fiddly. Ensuring consistency across a fleet of servers becomes a challenge, making changes consistently even more so. In response to this many organizations move Toggle Configuration into some type of centralized store, often an existing application DB. This is usually accompanied by the build-out of some form of admin UI which allows system operators, testers and product managers to view and modify Features Flags and their configuration.
Distributed Toggle Configuration
Using a general purpose DB which is already part of the system architecture to store toggle configuration is very common; it's an obvious place to go once Feature Flags are introduced and start to gain traction. However nowadays there are a breed of special-purpose hierarchical key-value stores which are a better fit for managing application configuration - services like Growthbook, Launchdarkly, or Posthog. Configuration can be modified dynamically whenever required, and all nodes in the cluster are automatically informed of the change - a very handy bonus feature. Managing Toggle Configuration using these systems means we can have Toggle Routers on each and every node in a fleet making decisions based on Toggle Configuration which is coordinated across the entire fleet.
Working with feature-flagged systems
Expose current feature toggle configuration
It's always been a helpful practice to embed build/version numbers into a deployed artifact and expose that metadata somewhere so that a dev, tester or operator can find out what specific code is running in a given environment. The same idea should be applied with feature flags. Any system using feature flags should expose some way for an operator to discover the current state of the toggle configuration.
Take advantage of structured Toggle Configuration files
It's typical to store base Toggle Configuration in some sort of structured, human-readable file (often in YAML format) managed via source-control. There are some additional benefits we can derive from this file. Including a human-readable description for each toggle is surprisingly useful, particularly for toggles managed by folks other than the core delivery team. What would you prefer to see when trying to decide whether to enable an Ops toggle during a production outage event: basic-rec-algo or “Use a simplistic recommendation algorithm. This is fast and produces less load on backend systems, but is way less accurate than our standard algorithm.”? Some teams also opt to include additional metadata in their toggle configuration files such as a creation date, a primary developer contact, or even an expiration date for toggles which are intended to be short lived.
Manage different toggles differently
As discussed earlier, there are various categories of Feature Toggles with different characteristics. These differences should be embraced, and different toggles managed in different ways, even if all the various toggles might be controlled using the same technical machinery.
Let's revisit our previous example of an ecommerce site which has a Recommended Products section on the homepage. Initially we might have placed that section behind a Release Toggle while it was under development. We might then have moved it to being behind an Experiment Toggle to validate that it was helping drive revenue. Finally we might move it behind an Ops Toggle so that we can turn it off when we're under extreme load. If we've followed the earlier advice around de-coupling decision logic from Toggle Points then these differences in toggle category should have had no impact on the Toggle Point code at all.
However from a feature flag management perspective these transitions absolutely should have an impact. As part of transitioning from Release Toggle to an Experiment Toggle the way the toggle is configured will change, and likely move to a different area - perhaps into an Admin UI rather than a yaml file in source control. Product folks will likely now manage the configuration rather than developers. Likewise, the transition from Experiment Toggle to Ops Toggle will mean another change in how the toggle is configured, where that configuration lives, and who manages the configuration.
Feature Toggles introduce validation complexity
- With feature-flagged systems our Continuous Delivery process becomes more complex, particularly in regard to testing.
- While a feature-flagged release candidate does need testing with a few toggle configurations, it is not necessary to test every possible combination. Most feature flags will not interact with each other, and most releases will not involve a change to the configuration of more than one feature flag.
- It's most important to test the toggle configuration which you expect to become live in production, which means the current production toggle configuration plus any toggles which you intend to release flipped On.
- It's also wise to test the fall-back configuration where those toggles you intend to release are also flipped Off.
- To avoid any surprise regressions in a future release many teams also perform some tests with all toggles flipped On.
- A good convention is to enable existing or legacy behavior when a Feature Flag is Off and new or future behavior when it's On.
Where to place your toggle
Toggles at the edge
- For categories of toggle which need per-request context (Experiment Toggles, Permissioning Toggles) it makes sense to place Toggle Points in the edge services of your system - i.e. the publicly exposed web apps that present functionality to end users.
- A side-benefit of placing Toggle Points at the edge of your system is that it keeps fiddly conditional toggling logic out of the core of your system.
- Placing Toggle Points at the edges also makes sense when you are controlling access to new user-facing features which aren't yet ready for launch.
- It's interesting to note that with some of these types of feature flag the bulk of the unreleased functionality itself might actually be publicly exposed, but sitting at a url which is not discoverable by users.
Toggles in the core
- There are other types of lower-level toggle which must be placed deeper within your architecture.
- These toggles are usually technical in nature, and control how some functionality is implemented internally.