Spotlight on SIG Architecture: Production Readiness

By Frederico Muñoz (SAS Institute) | Thursday, November 02, 2023

This is the second interview of a SIG Architecture Spotlight series that will cover the different subprojects. In this blog, we will cover the SIG Architecture: Production Readiness subproject .

In this SIG Architecture spotlight, we talked with Wojciech Tyczynski (Google), lead of the Production Readiness subproject.

About SIG Architecture and the Production Readiness subproject

Frederico (FSM): Hello Wojciech, could you tell us a bit about yourself, your role and how you got involved in Kubernetes?

Wojciech Tyczynski (WT): I started contributing to Kubernetes in January 2015. At that time, Google (where I was and still am working) decided to start a Kubernetes team in the Warsaw office (in addition to already existing teams in California and Seattle). I was lucky enough to be one of the seeding engineers for that team.

After two months of onboarding and helping with different tasks across the project towards 1.0 launch, I took ownership of the scalability area and I was leading Kubernetes to support clusters with 5000 nodes. I’m still involved in SIG Scalability as its Technical Lead. That was the start of a journey since scalability is such a cross-cutting topic, and I started contributing to many other areas including, over time, to SIG Architecture.

FSM: In SIG Architecture, why specifically the Production Readiness subproject? Was it something you had in mind from the start, or was it an unexpected consequence of your initial involvement in scalability?

WT: After reaching that milestone of Kubernetes supporting 5000-node clusters , one of the goals was to ensure that Kubernetes would not degrade its scalability properties over time. While non-scalable implementation is always fixable, designing non-scalable APIs or contracts is problematic. I was looking for a way to ensure that people are thinking about scalability when they create new features and capabilities without introducing too much overhead.

This is when I joined forces with John Belamaric and David Eads and created a Production Readiness subproject within SIG Architecture. While setting the bar for scalability was only one of a few motivations for it, it ended up fitting quite well. At the same time, I was already involved in the overall reliability of the system internally, so other goals of Production Readiness were also close to my heart.

FSM: To anyone new to how SIG Architecture works, how would you describe the main goals and areas of intervention of the Production Readiness subproject?

WT: The goal of the Production Readiness subproject is to ensure that any feature that is added to Kubernetes can be reliably used in production clusters. This primarily means that those features are observable, scalable, supportable, can always be safely enabled and in case of production issues also disabled.

Production readiness and the Kubernetes project

FSM: Architectural consistency being one of the goals of the SIG, is this made more challenging by the distributed and open nature of Kubernetes ? Do you feel this impacts the approach that Production Readiness has to take?

WT: The distributed nature of Kubernetes certainly impacts Production Readiness, because it makes thinking about aspects like enablement/disablement or scalability more challenging. To be more precise, when enabling or disabling features that span multiple components you need to think about version skew between them and design for it. For scalability, changes in one component may actually result in problems for a completely different one, so it requires a good understanding of the whole system, not just individual components. But it’s also what makes this project so interesting.

FSM: Those running Kubernetes in production will have their own perspective on things, how do you capture this feedback?

WT: Fortunately, we aren’t talking about “them” here, we’re talking about “us”: all of us are working for companies that are managing large fleets of Kubernetes clusters and we’re involved in that too, so we suffer from those problems ourselves.

So while we’re trying to get feedback (our annual PRR survey is very important for us), it rarely reveals completely new problems - it rather shows the scale of them. And we try to react to it - changes like “Beta APIs off by default” happen in reaction to the data that we observe.

FSM: On the topic of reaction, that made me think of how the Kubernetes Enhancement Proposal (KEP) template has a Production Readiness Review (PRR) section, which is tied to the graduation process. Was this something born out of identified insufficiencies? How would you describe the results?

WT: As mentioned above, the overall goal of the Production Readiness subproject is to ensure that every newly added feature can be reliably used in production. It’s not possible to enforce that by a central team - we need to make it everyone’s problem.

To achieve it, we wanted to ensure that everyone designing their new feature is thinking about safe enablement, scalability, observability, supportability, etc. from the very beginning. Which means not when the implementation starts, but rather during the design. Given that KEPs are effectively Kubernetes design docs, making it part of the KEP template was the way to achieve the goal.

FSM: So, in a way making sure that feature owners have thought about the implications of their proposal.

WT: Exactly. We already observed that just by forcing feature owners to think through the PRR aspects (via forcing them to fill in the PRR questionnaire) many of the original issues are going away. Sure - as PRR approvers we’re still catching gaps, but even the initial versions of KEPs are better now than they used to be a couple of years ago in what concerns thinking about productionisation aspects, which is exactly what we wanted to achieve - spreading the culture of thinking about reliability in its widest possible meaning.

FSM: We’ve been talking about the PRR process, could you describe it for our readers?

WT: The PRR process is fairly simple - we just want to ensure that you think through the productionisation aspects of your feature early enough. If you do your job, it’s just a matter of answering some questions in the KEP template and getting approval from a PRR approver (in addition to regular SIG approval). If you didn’t think about those aspects earlier, it may require spending more time and potentially revising some decisions, but that’s exactly what we need to make the Kubernetes project reliable.

Helping with Production Readiness

FSM: Production Readiness seems to be one area where a good deal of prior exposure is required in order to be an effective contributor. Are there also ways for someone newer to the project to contribute?

WT: PRR approvers have to have a deep understanding of the whole Kubernetes project to catch potential issues. Kubernetes is such a large project now with so many nuances that people who are new to the project can simply miss the context, no matter how senior they are.

That said, there are many ways that you may implicitly help. Increasing the reliability of particular areas of the project by improving its observability and debuggability, increasing test coverage, and building new kinds of tests (upgrade, downgrade, chaos, etc.) will help us a lot. Note that the PRR subproject is focused on keeping the bar at the design level, but we should also care equally about the implementation. For that, we’re relying on individual SIGs and code approvers, so having people there who are aware of productionisation aspects, and who deeply care about it, will help the project a lot.

FSM: Thank you! Any final comments you would like to share with our readers?

WT: I would like to highlight and thank all contributors for their cooperation. While the PRR adds some additional work for them, we see that people care about it, and what’s even more encouraging is that with every release the quality of the answers improves, and questions “do I really need a metric reflecting if my feature works” or “is downgrade really that important” don’t really appear anymore.