- YEAR 2016. CONFERENCE ROOM AT A LARGE FINANCIAL SERVICES ORGANIZATION -
- PROFESSIONAL SERVICES ENGINEER PULLING HIS HAIR -
(Yelling)
Noooooooooooooo!
(room becomes silent, all turn to look at the engineer)
AND SO IT BEGINS:
Soon - on an CloudBees Flow server in a data center near you!
OK, enough with the cheesy script..
Let’s talk business. ACL business.
You may not want to.. (rarely, does someone wake up in the morning feeling like ‘this is a beautiful day to do a deep-dive into Access Control.’ ) -- but trust me: you REALLY do need to talk ACL. Like, now.
The CloudBees Professional Services team has recently wrapped-up a long term engagement at a major US financial institution, where as part of the extensive implementation, we were asked to implement extremely fine-grained access restrictions on multiple objects in CloudBees Flow, like:
procedures to run
pipelines to modify and run
data
environments
and many more…
One of the main requirements was to make sure that those with access to the DEV environment did not have access to PROD. This separation of roles and environments can protect against possible breaches, where if someone might insert some malignant code in DEV, it cannot be pushed in PROD by the same person, but requires a separate tier of stakeholders to be involved and serve as a check-point/gateway for promoting changes through to higher environments.
Our experience has taught us that properly defining ACLs (Access Control List) for large-scale organizations is not a simple task. It requires thoughtful planning, due to the powerful restrictions or permissions that ACLs provides, which are ubiquitous throughout your pipeline and all your procedures . Without careful consideration and planning, changing these ACLs mid-way can make things very complicated, particularly for large enterprises with fine-grained permission levels, separation of duties and approval processes.
Thinking about ACLs early in the process of your implementation and pipeline development will help you design and maintaining an Access Control policy that:
Provides the right level control
Is maintainable
Limits complexity
Here are some things to keep in mind when designing your ACL policy:
Role-based access controls: It’s advisable to try and limit the number of Roles and map a finite list of actions approved for each role. To manage role-based access controls you may use groups- such as Local or LDAP.
‘Deny’ access setting has priority over ‘Allow’ . Therefore, it is recommended to limit the use of ‘Deny’ when setting up ACLs as much as possible. Remember, if you belong to a group with ‘Deny’ and a group with ‘Allow’, you are effectively, denied.
Inheritance allows you to go to the object’s parent to check for permission if a specific ACL is not defined on the object itself. This is a great feature, but you need to be aware of the side effect of the changes, especially when you update the ACLs for the objects higher in the hierarchy, and their settings trickle down. The obvious example is any change to the "server" object (the top object) has system-wide repercussion.
Objects vs. Properties: CloudBees Flow allows you to control access to all its objects (projects, procedures, environments, releases, pipelines, resources, workspaces...). While we do not restrict access to Properties, many of our customers create those in a property sheet, and assign ACL access rights to protect this sheet.
Test often: Be sure to test your access control setting thoroughly with any new change.
Best practices for implementing your ACL policy:
Decide on your groups: When we designed our ACL matrix for this financial institution, we decided to define 3 Roles: Release Managers, Deployment Managers, and Integration Engineers. Each Role had a different set of access controls depending on whether it was for a Production environment or Pre-Productions - such as Dev, QA1, QA2, etc.
Identify and document all objects and levels to control access to: We used an Excel spreadsheet to have a one-page documentation we could refer to easily for the ACL policy. This is an important step to take before you start coding your ACLs in the system, to make sure you have visibility to the entire ACL matrix.
Use Factory Procedures and CloudBees Flow DSL to easily add new applications or projects. We created a couple of DSL scripts to be invoked inside Procedure steps:
One for the ACLs on top level objects and discretionary projects and plugins,
One to set ACLs per Division (and a top-level step to loop through)Procedures make it easy for the Administrator to allow controlled actions and easily reproduce ACLs across all environments.For large-scale DSL implementations, make sure to use group-naming conventions, to make it easy to maintain and script your ACLs in DSL.
Test your implementation: Sometimes you don’t foresee the impact of a small change, so be sure to test your implementation. In our case, we even had a separate install of the CloudBees Flow server in each environment – Dev, QA1, QA2 and Prod – which allowed us to test our automation code the same way you would for your application code as it gets promoted throughout the pipeline.
Don’t compromise ACL fidelity across all environments: As is often the case, there were some use cases we had not planned for or that were not defined in the requirements, so some groups of people were locked out of certain actions or environments.
How it can go wrong:
Against our recommendations, the customer decided to overwrite permissions manually, to allow for a quick resolution in QA1 environment for testing to resume.
While this may seem like a good idea to save time, remember that the gains are very short lived: temporarily overwriting the ACL definitions in the system, or having inconsistent definitions across the different environments, not only compromises the rest of your ACL testing (since permissions settings are manually overwritten), but essentially may compromise your auditability and security altogether.
As we expected, since the configuration on QA1 was overwritten, we ended up finding some additional permission settings that needed fine-tuning in the higher environments of QA2, and even PROD, which we fixed later in the process.
The lax ACLs also enabled some team members to ‘take advantage’ and bypass some of the mandatory environments, tests, or gateways along the pipeline. For example, a team that was given temporary permission to bypass DEV and QA1, suddenly decided to develop a new feature straight on QA2, without validating this code in lower environments and integration tests. Also, the sporadic changes to different environments along the pipeline caused numerous configuration drifts and code drifts, so that code that had been developed on DEV could not reliably be run on the higher environments.
Cutting Corners is never a good long-term strategy
Remember: your developers and Ops folks all have good intentions. Truly. But it’s also human nature to try to bypass some of the rules and checkpoints in place, if we feel it might make our job even slightly faster. What ends up happening is that these decisions to bypass automation for ACL (or other security validations) slow down your entire release pipeline, and canseverely jeopardize your product quality, production environment and service continuity.
As you can see, the few hours we might have saved by taking a shortcut on QA1, were paid multiple times over later.
The Director’s Cut:
So, after these ACL wars, we’ve all gained a few more grey hairs than when we started 2016. Here are some lessons we learned:
Always use an automatic mechanism to set up your ACLs. It is very easy to do with DSL nowadays. This will make your security policy reproducible in all your environments!
Stick to your automatic way!
Do not set up permission manually in the UI. It will bite you later in QA or PROD. We knew - but we caved under pressure, and it made everyone’s experience worse afterwards
Shortcuts, when it comes to security, are very painful and risky . They will cost you far more than you save.
Document your ACL policy , your permission matrix, and the reasons why you defined things the way you did. In three months, you will not remember why project X has ‘Write’ permissions on your Artifact Repository or why Joe is allowed to run this particular hidden procedure.
Feel free to share your own ACL horror stories in the comments below, or else we are bound to repeat our mistakes if we don't learn from our history.
And going into 2017, I wish all of us that the next time we implement ACLs, it flows a bit less like a war movie, or a nail-biting thriller - and more like a romantic comedy: we make one or two minor glitches mid-way, recover quickly, laught about it, and end with a “happy ever after”!
Here’s to that! :)