I asked myself this question at my first job at Scalable Capital, 4 years ago. I had started at a FinTech startup/scaleup with somewhere between 50 and 100 engineers, enthusiastic about DevOps and a You build it, you run it mindset, which I learned about in my software engineering education in university.
But reality hit hard. You can't simply give every engineer full production access, justifying lax credential management with We want to give people ownership and trust.
Yes, you want to hire smart people as a leader, and you need to trust your engineers for them to be productive. But even if your hiring is perfect and you never hire anyone that's malicious, the greatest minds make mistakes and giving every developer access to production environments will be an issue eventually.
This doesn't have to be leaking a credential or leaving their Macbook open at a Starbucks. This can even be running the DROP TABLE;
statement on the production database instead of the development environment.
And I am not making these scenarios up, this happens in the real world even to companies like GitLab.
Why do developers need access to production?
In some organizations I've heard the argument that developers don't need access to production at all. Actually noone would need it. Or a phrase like: Developers should just write the code and then the operations team will deploy it. But in my opinion this is a very outdated way of thinking. In a modern DevOps style "you build it, you run it" culture, devs need access to the production environment. How else are they supposed to do their job of "running it" or feel any sense of ownership? You don't have to ask someone else for they key to your own car do you? If you're not convinced yet, here are 3 simple examples scenarios that you can probably relate too:
- If production is down, e.g. a service is not reachable after a migration failed
- A bug has to be investigated, e.g. a user reports that a payment is not being processed
- Someone from business wants a report that the BI system doesn't cover yet, e.g. some tax report
In all these cases someone has to log in to prod probably connect to the database and run some queries or statements. And usually the best person to solve a problem or help with the adhoc request is the one that actually wrote the code in the first place and knows how the schema is structured and used.
Why giving everyone access to production is a bad idea
But as I mentioned earlier, giving everyone access to production systems is a bad idea. And it's not just a security risk, it's also a risk to the stability of the system. The main security risks with production access are in order of likelyhood:
- Innocent mistakes: Running a query on the wrong database system, or even the wrong server, or running an inefficient query that locks the database and brings down a part of the system. Developers execute
DROP TABLE;
on the wrong database all the time or accidentally run their test suite against the production database... - Accidental leaks: Developers might accidentally leak credentials by sending them via slack or other messengers. This is a common issue, and as soon as the wrong person gets their hand on the credentials the system is compromised. This can also happen by leaving the laptop open at a starbucks or maybe even by simply getting your luggage stolen at an airport.
- Malicious intent: This is the least likely scenario, but it is a possibility and a desastrous one if it happens. If a developer is disgruntled or even just wants to make a quick buck, they could do a lot of damage. This is especially dangerous if the developer has access to the production database, as they could steal sensitive data, delete user data, or hold the company hostage via ransomware. This of course requires write access.
If you start searching for these scenarios you will find a articles for every single one, so I am not making this up. You need to have a sound setup around your production environment, otherwise one of these will eventually happen.
How not to do it
A common approach is to have an Operations Team handle such topics, or the so called SRE or DevOps team do it. These teams don't do normal software development but are instead responsible for infrastructure. But giving them the burden of handling database access management usually slows the resolution of problems down since information first has to travel from one team to the other and it's impossible for one team to understand the whole system well enough to be operationally responsbile. It's also very stressful work to be the one that has to fix things permanenlty, and it's not very rewarding either. Firefighting will usually kick up your adrenaline and cortisol levels, but it's not a good way to spend all of your working life.
But somehting is even worse in this setup: It splits the ownership of a function of your software. And this goes very much against the core of a modern devops culture (or, more recently platform engineering). Splitting ownership causes a variety of problems, most notably:
- Blame shifting: If something goes wrong, it's the other team's fault
- Lack of understanding: If you don't have to fix it, you don't have to understand it
- Slow feedback loop: If you have to wait for another team to fix it, you can't learn from your mistakes
- Lack of responsibility: If you're not responsible for it, you don't care about it
- Lack of motivation: If you're not responsible for it, you don't get the satisfaction of fixing it
I personally deem the slow feedback loop as the main issue. If a team doesn't have to operate their software they can't learn from their mistakes. And if you can't learn from your mistakes, you can't improve your software engineering practices and skills. Meaning teams will ship code that is potentially slow, hard on the database, causes inconsistencies in the data or even security vulnerabilities without ever being held accountable for it.
The SRE Team/Devops Team solves it has another issue as well, if a whole developer team gets willy nillly access to production it is also still a security risk. If one of the engineers' credentials gets leaked, the whole system is compromised. Or they accidentally install a malicious package, or even worse, they are malicious themselves. This happened at LastPass earlier in 2023:
Incident 2 Summary: The threat actor targeted a senior DevOps engineer by exploiting vulnerable third-party software. The threat actor leveraged the vulnerability to deliver malware, bypass existing controls, and ultimately gain unauthorized access to cloud backups. The data accessed from those backups included system configuration data, API secrets, third-party integration secrets, and encrypted and unencrypted LastPass customer data.
So what's the solution?
Do you have to decide between a modern, agile workflow and security? Can you not have both?
To a degree, it is in fact often a tradeoff that is made: Introducing more processes, to at least achieve compliance and on the flip side deal with slower developement. But seldomly do these processes actually make the system more secure, resulting in a sort of security theater.
Sadly the alternative is a bit expensive, but it usually makes a lot of sense to invest into more sensible processes around developer access, so hear me out: It involves building up an internal toolstack that allows engineers to safely access production directly. This is in fact what larger companies often end up doing.
I talked to Engineers at AWS and Azure, and both have fully fleshed-out internal applications that allow engineers to shadow each other's sessions when they access production resources. Allowing to pull the plug on each other in case someone attempts something shady. Aslong as developers work together they have rather free reign on what to do in prod but due to having 2 or more people involved in every action, the risk of something going wrong is drastically reduced.
Core Features of a solution like that should be:
- SSO login: So no credentials are shared, every modern cloud native tool has this
- Extensive Audit trailing: So at least if something still goes wrong you have a trace
- 4-Eyes Principle: This is by far the best measure to actually prevent mistakes*and malicious attacks.
But this sounds quite generic, why is there no prebuilt solution out there for this task?
Well there is a few, Most notably Teleport which at least in the enterprise version could tick all the boxes. But it is also incredibly expensive as well as complex to set up.
Four Eyes Principle
I want to expand a bit on The 4-eyes principle since it is a very powerful tool in security. It is a principle that requires that at least two people approve an action before it can be executed. It makes it much harder for a single person to do something malicious. It is also a very powerful tool in preventing mistakes, as it makes it much more likely that someone will catch a mistake before it is executed.
We use the Principle is software engineering all the time, code reviews are a form of the 4-eyes principle. But it is also used in other fields, e.g. in the military, where two people have to turn a key at the same time to launch a missile. It is also used in banking, where two people have to approve a transaction above a certain amount. So why not use it in production access?
You can even expand on the 4-eyes principle, e.g. by requiring that the two people have to be from different teams, or that they have to be from different departments. This makes it even harder for a single person to do something malicious, as they would have to convince someone from another team to help them. Or you require CTO approval, or approval of 3 people, or 4 people, or 5 people. The possibilities are endless, each step making it harder for a single person to do something malicious, without losing the agility of individuals having direct access to production.
The best part is that you do not end up with a weak link in your system, as you would with a single person having access to production. If one person's credentials get leaked, the system is still secure, as they would need another person to approve their actions. And if one person makes a mistake, the other person can catch it before it is executed.
Of course I would for a startup usually suggest to give every engineer read access and just require a second person for writes, but the principle can be expanded to any level of security that you actually need.
So if you're just as baffled as we are at why a tool implementing this principle doesn't yet exist yet in the world and we are left with suboptimal or even insecure practices, feel free to have a look at Kviklet which is my attempt at solving this problem once and for all.