Scaling Reliability: DevOps Insights from Caio Pessoa

What does it take to keep modern platforms running reliably at scale? We sat down with one of our Site Reliability/DevOps consultants, Caio Pessoa, to discuss his journey from network engineering to platform engineering, the challenges of supporting large-scale environments, and the growing importance of observability in today's technology landscape.
From designing monitoring platforms and reducing operational costs to fostering collaboration and shared ownership across teams, he shares his perspectives on reliability, engineering culture, and the future of DevOps.
Can you share a bit about your journey before joining agap2 and how it prepared you for your current role?
My career started in network engineering, across some pretty different industries - HFT, streaming media, and online retail. Each one had its own flavor of "reliability is non-negotiable." In HFT, milliseconds matter, and downtime is simply not an option. In streaming, you're dealing with scale and real-time delivery to massive audiences. In retail, it's about consistency and uptime across distributed systems. That range of environments pushed me to think deeply about observability and system health before I even had the SRE job title. Transitioning into DevOps felt like a natural evolution — I was already asking the same questions, just with different tools.
As a consultant, what does your day-to-day work usually involve when supporting a project, and how have you applied that approach while working with our company?
Day-to-day, it's a mix of platform engineering, documentation, stakeholder communication, and hands-on technical work. Honestly, that variety is one of the things I enjoy most. What stays constant is the mindset: understand the problem first, build something that lasts, and always leave clear documentation behind. The consultant mindset helps me stay structured: I always ask "who is this for, what problem does it solve, and how do we document it so it outlives my presence on the project.
In your current project, where do you see the biggest impact of your work as a Site Reliability/DevOps Engineer?
Probably the observability platform itself. We're building and maintaining a centralized platform based on the LGTM stack (Loki, Grafana, Tempo, Mimir) that multiple business teams across the organization rely on. When the platform works well, every team that ships software benefits, even if they never think about it. There's also a concrete cost-reduction effort I'm involved in, migrating AKS monitoring from Log Analytics to Azure Managed Prometheus. It's the kind of work that doesn't always get the spotlight, but it has a real impact on both performance and budget. I also invest heavily in documentation: well-maintained Confluence pages mean knowledge doesn't leave with the engineer.
DevOps often involves balancing speed and stability—how do you approach that in practice?
I think observability is key to avoiding the need to choose between them. If you can see what's happening in your system in real time, you can move fast with confidence.
My approach is to treat monitoring as part of the feature, not something you add after. Before something goes to production, I want to know how we'll know if it's broken.
That mindset — "instrument first, ship second" — lets the team move quickly without flying blind. I also believe in progressive rollouts and clear runbooks — speed is fine as long as you know how to roll back. Stability doesn't mean slow; it means deliberate.
What kind of challenges do you encounter most often in your current environment?
Honestly, a lot of them are at the intersection of technology and communication. The technical challenges are fun — debugging an authentication issue in Grafana, replacing a deprecated tool without breaking pipelines, and figuring out why a metric isn't showing up.
But the trickier ones are translating that complexity for stakeholders who just want to know: "Is my service healthy?"
Getting that bridge right between deep technical work and clear, useful communication is something I work on constantly.
What do you think makes a DevOps culture successful within a team or organization?
Shared ownership and honesty. DevOps breaks down when people treat it as a handoff process rather than a shared responsibility. The teams I've seen work best are the ones where everyone cares about reliability, not just the ops side. I also think psychological safety matters a lot — people need to feel comfortable raising problems early, before they become incidents. And good documentation is underrated as a cultural signal: it shows the team respects the people who come after them.
If you look ahead a few years, how would you like your role or expertise to evolve?
I'd like to continue growing on the technical side, but increasingly in a direction that lets me also influence how teams and organizations think about reliability and platform engineering as a whole. I'm genuinely excited about where the observability space is heading, and I want to be someone who helps shape that direction — not just execute on it. Whether that means mentoring engineers, driving architectural decisions, or contributing to engineering culture, I see leadership as a natural next step alongside the technical depth I'm continuing to build.
.avif)
.avif)





