DevOps team members collaborating constructively around digital systems in modern workspace
Publié le 18 mai 2024

Contrary to common belief, the persistent « blame game » between Dev and Ops is not a personality conflict; it’s a direct result of broken organisational systems and misaligned incentives.

  • Shared goals are meaningless if individual KPIs and performance reviews still reward siloed behaviour (e.g., developers rewarded for features, operations for stability).
  • Technical tools (CI/CD) alone cannot fix cultural problems. True collaboration requires redesigning team structures, knowledge sharing protocols, and on-call policies.

Recommendation: Stop trying to fix the people and start re-engineering the systems they work in. This guide shows you how to dismantle the structures that create conflict and build a foundation for genuine collaboration.

As a Head of Engineering, you’ve likely witnessed the ‘standoff meeting’ after a production incident. The development team points to infrastructure fragility, while the operations team highlights untested code pushed at the end of the day. This cycle of finger-pointing, the « blame game, » is more than just frustrating; it’s a significant drag on productivity, a source of engineer burnout, and a clear sign that your DevOps initiative is merely a label, not a culture.

Many leaders try to solve this with familiar platitudes: « We need better communication, » « Let’s buy a new observability tool, » or « Everyone needs to own quality. » Yet, the friction remains. This is because these are surface-level treatments for a deep, systemic issue. The conflict you’re seeing isn’t a failure of individual attitudes; it’s a predictable outcome of how your teams are structured, measured, and incentivised.

But what if the key wasn’t forcing people to ‘collaborate’ more, but redesigning the environment so that collaboration becomes the path of least resistance? The true path to uniting your teams lies in dismantling the systemic friction points—the conflicting KPIs, the knowledge silos, the punishing on-call schedules—that pit them against each other. This article provides a strategic framework for you, as a leader, to move beyond the blame and engineer a culture of shared responsibility and continuous improvement.

This guide will walk you through the systemic changes required to dismantle the silos and build a truly unified team. We will explore how to diagnose hidden risks, align incentives, structure your teams for collaboration, and implement processes that foster psychological safety and growth.

Why your « Bus Factor » is dangerously low in the operations team?

The « Bus Factor » is a stark but effective measure of risk: how many key people could be hit by a bus before your project or system grinds to a halt? In many organisations, this number is terrifyingly low, especially within the operations team. You have one « database guru » or a single engineer who understands the legacy deployment scripts. This isn’t a sign of their genius; it’s a critical point of systemic failure waiting to happen. These knowledge silos are not just a risk; they are a primary source of friction. When only one person can solve a problem, they become a bottleneck, and any issue in their domain automatically creates a dependency that frustrates other teams.

This problem is widespread. Recent surveys reveal that 52.9% of teams struggle with reaching team members who possess specialized knowledge. This isn’t just an inconvenience; it’s a direct inhibitor of flow and a catalyst for blame. Developers, blocked by an unavailable specialist, grow frustrated. The specialist, overwhelmed with requests, becomes a gatekeeper. To fix this, you must increase « knowledge liquidity »—the ability for crucial information to flow freely throughout the team.

This involves moving away from reliance on individual heroes and building a system of shared understanding. Key strategies include paired programming on infrastructure tasks, mandatory documentation for any new service, and rotating on-call responsibilities. The goal is to make knowledge a shared asset, not a private possession. By systematically de-risking your team from individual dependencies, you not only improve resilience but also reduce a major source of inter-team tension. The « guru » can finally take a vacation, and the rest of the team is empowered to solve problems independently.

How to align KPIs so Developers care about uptime?

One of the most powerful drivers of the blame game is the inherent conflict in traditional Key Performance Indicators (KPIs). Your development team is likely measured on velocity, feature shipment, and story points completed. Their incentives push them to move fast and introduce change. Meanwhile, your operations team is measured on uptime, stability, and mean time to resolution (MTTR). Their incentives push them to resist change and maintain stability. When you reward two groups for opposing goals, conflict is not just likely; it is guaranteed.

The solution is not to ask developers to « care more » about uptime. The solution is to make uptime a part of their success metrics. This is where concepts like Service Level Objectives (SLOs) and Error Budgets become transformative. An SLO is a specific, measurable target for reliability (e.g., 99.9% availability for the login service). The Error Budget is the inverse: the 0.1% of acceptable downtime. This budget is a shared resource. As long as the service is operating within its SLO, the development team has the « budget » to experiment, innovate, and deploy new features rapidly. However, if an incident causes the service to dip below its SLO, the error budget is « spent. »

When the budget is exhausted, a pre-agreed policy kicks in: all new feature development halts, and the entire team (Devs and Ops) swarms on reliability and stability work until the budget is replenished. Suddenly, developers have a direct, quantifiable incentive to write reliable code and test it thoroughly. Uptime is no longer « Ops’ problem »; it’s a shared constraint that governs the pace of innovation. This shifts the conversation from « your code broke my server » to « our deployment consumed our error budget, how do we fix the system to earn it back? »

Generalist or Specialist: Who drives better DevOps adoption?

The traditional model of IT departments, built on hyper-specialisation, is another pillar supporting the blame culture. You have a networking team, a database team, and a security team, each with deep expertise in one area and little visibility into others. When a problem occurs, it triggers a slow, sequential handoff between silos, with each team proving it’s not their fault before passing the ticket along. This structure is inherently inefficient and confrontational.

The DevOps model champions a different kind of engineer: the « T-shaped » professional. A T-shaped individual possesses deep expertise in one primary discipline (the vertical bar of the « T ») but also maintains a broad, working knowledge of adjacent domains (the horizontal bar). A T-shaped developer, for instance, not only writes application code but also understands infrastructure as code, monitoring principles, and security basics. This doesn’t mean every developer needs to be a Kubernetes expert, but they must be « operations-aware. » This structure is far more effective at fostering collaboration than a team of siloed specialists.

The challenge, as you know, is that these professionals are in high demand. Data shows that 37% of IT leaders cite a lack of DevOps and DevSecOps skills as a top technical gap. You cannot simply hire a team of T-shaped engineers; you must actively cultivate them. This means investing in cross-training, creating opportunities for engineers to rotate through different roles, and rewarding individuals who broaden their skill sets, not just deepen them. By valuing and building T-shaped capabilities, you create a team that can diagnose and solve problems holistically, rather than tossing them over a wall.

This table illustrates the fundamental impact on team dynamics.

T-Shaped vs Traditional Team Skills Distribution
Skill Model Primary Expertise Secondary Skills Team Impact
Traditional Specialist Deep in one area (100%) Minimal (10-20%) Creates bottlenecks
T-Shaped Professional Deep in one area (80%) Broad in adjacent areas (60%) Enables collaboration
Pure Generalist Surface level (40%) Wide coverage (40%) Lacks depth for complex issues

The scheduling error that causes your best engineers to quit within a year

Few things erode an engineer’s goodwill faster than a poorly designed on-call schedule. It’s a systemic issue often overlooked, but it is a primary driver of burnout and a significant contributor to the « us vs. them » mentality. When developers are repeatedly paged in the middle of the night for issues they cannot fix, or when the same operations engineer is the single point of failure for every critical alert, you are actively burning out your most valuable assets. The cognitive load of being constantly on-edge, context-switching, and sleep-deprived is immense.

This isn’t just about inconvenience; it’s about respect for your engineers’ time and mental health. A system that generates a high volume of non-actionable alerts or pages the wrong person demonstrates a fundamental lack of respect for their expertise. It creates resentment. The developer being woken up for a database issue they can’t diagnose feels helpless and angry. The operations engineer bearing the brunt of every alert feels unsupported and overwhelmed. Both are prime candidates for attrition.

As Kyle from the Chaos and Reliability Engineering Blog points out, the human cost is the real metric to watch. His reflection on the culture of being on-call highlights a crucial truth about engineer satisfaction.

Although the pay was handsome, the toll it takes on your mental health and physical health can sometimes be more demanding. The happiest SREs/DevOps/Platform Engineers are the ones that A.) Never get paged B.) get paged rarely C.) Working in a blameless culture and getting paged just means an interesting problem to solve

– Kyle (Chaos and Reliability Engineering Blog), Building a Blameless Culture & Managing Mental Health

Fixing this requires a systemic approach. First, ruthlessly audit your alerts. Every alert should be actionable, have a clear owner, and include a runbook. Implement « follow-the-sun » rotations if you have a global team. Ensure that on-call duties are fairly distributed and that engineers get adequate, uninterrupted time off to recover. A humane on-call schedule isn’t a perk; it’s a critical piece of infrastructure for a healthy engineering culture.

How to run a post-incident review that actually fixes the root cause?

The single most powerful tool for dismantling a blame culture is the blameless post-incident review (often called a post-mortem). However, most organisations get this wrong. Their reviews devolve into a search for a scapegoat, focusing on « who made the error » rather than « why did the system allow this error to happen? » A truly blameless review operates on a fundamental assumption: everyone involved had good intentions and made the best decision they could with the information and tools available at the time. The focus shifts from individual failure to systemic weakness.

This approach is a cornerstone of Site Reliability Engineering (SRE), famously pioneered by Google. It’s not about being « soft » on mistakes; it’s a pragmatic recognition that human error is inevitable. A robust system should be designed to tolerate it. Therefore, the goal of the review is to identify and fix the contributing factors in the system—be it a confusing user interface, a gap in monitoring, an ambiguous process, or a lack of safeguards.

Case Study: Google’s Blameless Post-Mortem Culture in SRE

To institutionalise a culture of learning from failure, Google’s SRE practice treats blamelessness as a non-negotiable principle. When an incident occurs, the resulting post-mortem report explicitly avoids naming individuals in a negative light. The analysis focuses on a timeline of events, the impact, the actions taken, and most importantly, the contributing factors across the entire system. The output is a list of concrete, prioritised action items aimed at improving the system’s resilience, not punishing a person. This practice creates the psychological safety necessary for engineers to be open and honest about what happened, ensuring the organisation learns the right lessons and becomes stronger.

To run an effective review, you must enforce a few key rules. Ban words like « should have » and « human error. » Instead, ask « Why did this make sense at the time? » and « How could our tooling have made the right action easier? » Dig for multiple contributing factors rather than a single « root cause. » A single cause is a comforting fiction; real-world failures are almost always a chain of small, interconnected events. By focusing on the system, you turn incidents from moments of blame into invaluable opportunities for collective learning and improvement.

How to restructure departments into squads without causing chaos?

If your organisational chart still shows separate « Development » and « Operations » departments, you are structurally reinforcing the silo you want to break down. The physical and managerial separation of these teams creates a boundary that fosters an « us vs. them » mentality. To truly unite them, you must restructure for shared ownership. The most effective model for this is the cross-functional « squad » or « product team. »

A squad is a small, autonomous, and long-lived team that contains all the skills necessary to deliver and operate a specific product or service. This typically includes developers, an operations engineer (or an SRE), a QA analyst, and a product owner. This team is not a temporary project group; they are a durable unit responsible for the full lifecycle of their service, from « you build it » to « you run it. » This structure inherently aligns incentives. When the same team that writes the code is also woken up by alerts when it breaks in production, they naturally start building more resilient and operable software.

However, a « big bang » reorganisation can be disruptive and create chaos. The key is to start small, with a pilot program. Select a single, high-impact but non-critical service and form your first cross-functional squad. Give them a clear mission, the autonomy to make decisions, and the support to succeed. This pilot serves as a social-proof engine for the rest of the organisation. When other teams see the pilot squad moving faster, shipping more reliable code, and having a higher morale, they will want to adopt the model themselves. This creates a pull for change, rather than a top-down push that meets resistance.

Your Action Plan: Pilot Squad Implementation Roadmap

  1. Project Selection (Weeks 1-2): Choose one high-impact, low-risk project as a pilot. Ideally, it should be customer-facing but not mission-critical to the entire business.
  2. Team Formation (Weeks 3-4): Assemble a cross-functional squad with volunteers. Ensure it includes representation from development, operations, QA, and product leadership.
  3. Define the « Team API » (Weeks 5-8): Task the squad with documenting their mission, primary communication channels (e.g., dedicated Slack channel), key success metrics (SLOs), and protocols for how other teams should engage with them.
  4. Execute and Iterate (Weeks 9-12): Let the pilot run for a full quarter. Mandate weekly retrospectives to allow the team to adjust its own processes and overcome initial friction.
  5. Showcase and Scale (Week 13): At the end of the pilot, have the squad present their results, metrics, and lessons learned to senior leadership and other teams to build momentum for a broader rollout.

How to stop your retrospectives from becoming « moaning sessions »?

Retrospectives are a cornerstone of any agile or DevOps culture, intended to be a regular opportunity for a team to reflect and improve. Yet, for many teams, they become dreaded, unproductive « moaning sessions. » The same unresolved complaints are raised sprint after sprint, no meaningful actions are taken, and the team leaves feeling more cynical than energised. When your primary feedback loop is broken, continuous improvement is impossible, and frustration festers, feeding the blame culture.

The primary cause of a failed retrospective is a lack of structure and a failure to focus on actionable outcomes. A vague prompt like « What went wrong? » invites complaining. To fix this, you must introduce structured formats that guide the conversation toward constructive analysis and concrete actions. It’s also crucial to ensure psychological safety. If team members fear being blamed for bringing up a difficult topic, they will remain silent, and the most important issues will never be addressed.

To break the monotony and encourage different perspectives, rotate through various retrospective formats. This prevents the meeting from becoming stale and forces the team to think about their work in new ways. Here are five formats to introduce:

  • Sailboat Retro: The team identifies « winds » (what propels us forward), « anchors » (what holds us back), « rocks » (potential risks we see ahead), and the « island » (our ultimate goal or destination).
  • 4 Ls Retro: A simple and effective format where participants document what they Liked, Learned, Lacked, and Longed For during the sprint.
  • Timeline Retro: The team collaboratively builds a visual timeline of the sprint, plotting key events (deployments, incidents, meetings) and marking their emotional highs and lows. This helps identify patterns.
  • Starfish Retro: Participants categorise potential actions into five areas: Keep Doing, More Of, Less Of, Stop Doing, and Start Doing. This is highly action-oriented.
  • Appreciation Retro: Occasionally, dedicate the first half of the session purely to recognising and appreciating the contributions of teammates. This builds trust and positive momentum before tackling challenges.

Critically, every retrospective must end with a small number (1-3) of clear, assigned, and time-boxed action items. These items should be treated with the same seriousness as any other task in the next sprint backlog. By demonstrating that feedback leads to tangible change, you restore faith in the process and turn moaning sessions into engines of improvement.

Key Takeaways

  • The « blame game » is a systemic problem, not a personal one. Fix the system, not the people.
  • Aligning incentives with shared metrics like Service Level Objectives (SLOs) and Error Budgets is more effective than asking people to « collaborate ».
  • A blameless culture is not about avoiding accountability; it’s about shifting focus from individual error to systemic weakness to enable organisational learning.

How to Implement Iterative DevOps Cycles for Continuous Improvement?

You’ve diagnosed the knowledge silos, realigned KPIs, and restructured into squads. You’re running blameless post-mortems and productive retrospectives. The final piece of the puzzle is to embed this into a virtuous cycle of continuous improvement. The end of the blame game is not a static destination; it is a dynamic state maintained by a relentless focus on learning and iteration. A true DevOps culture is never « done. » It is a continuous process of sensing, responding, and evolving.

This means formalising the feedback loops you’ve created. Action items from post-mortems and retrospectives must be fed directly back into the development backlog. They should be prioritised alongside new features. If a systemic issue is identified, fixing it should be considered as valuable—or more valuable—than shipping the next product enhancement. This commitment must be visible and championed by you, the engineering leader. When your teams see that reliability and process improvement work are treated as first-class citizens, the culture shifts permanently.

Mature organisations take this a step further, moving from a reactive to a proactive stance. Instead of just learning from failures, they actively seek to uncover weaknesses before they impact customers. This is the domain of practices like Chaos Engineering, where teams intentionally inject controlled failures into their systems to test resilience and discover unforeseen dependencies. This is the ultimate expression of a blameless, learning culture—where failure is not something to be feared, but a tool to be wielded for improvement. The impact is profound; organisations implementing continuous improvement cycles report seeing up to 200 times faster lead times for changes, demonstrating a clear link between cultural health and business performance.

Case Study: Netflix’s Proactive Improvement via Chaos Engineering

Netflix popularised the concept of « GameDays, » where engineering teams run planned resilience tests to identify bottlenecks and availability problems through controlled load and failure injection. This practice embodies a proactive approach to improvement. As Jesse Robbins, a key figure in the movement, noted: ‘you cannot choose whether or not to have failures, they will happen no matter what you do… you can choose when you will learn the lessons.’ By choosing to learn during business hours in a controlled manner, Netflix avoids learning those same lessons at 3 AM during a critical outage, solidifying a culture of proactive resilience.

As the Head of Engineering, you are the chief architect of your organisation’s operating system. By systematically dismantling the structures that breed conflict and thoughtfully designing new ones that reward collaboration, you can permanently end the blame game. The next step is to identify the single biggest point of systemic friction in your own team and start the conversation about redesigning it.

Rédigé par Sarah Jenkins, Sarah is a seasoned Digital Transformation Director specializing in organizational agility and hybrid workforce management. Holding an MBA from the London School of Economics, she has guided FTSE 250 companies through complex restructuring phases. With over 15 years of experience, she helps leaders navigate the shift from strict hierarchies to autonomous, high-performing squads.