Company

October 25, 2023

Launching Chkk Operational Safety Platform

Written by

Awais Nemat

Turbo charge your Upgrades

Instantly assess upgrade complexity.
No credit card required.

Start for Free

Estimated Reading time

5 min

Today, my team and I are excited to publicly launch the Chkk Operational Safety Platform. We want to thank our early customers and design partners that have been working with us very closely since February, when we opened our waitlist to anyone interested in proactively addressing infrastructure errors, disruptions, and failures.

I'm humbled that our customers love Chkk, already used by enterprises across various industry verticals. Thank you for helping us validate our thesis and develop our product.

Our core thesis

While working at AWS, we observed a recurring pattern: different enterprises, at different points in time, experienced the same errors, failures, and disruptions due to the same root causes. Every company reactively responded to the same set of issues that other companies had already dealt with. There was no easy way for any of them to find out, a priori, about known Operational Risks lurking in their infrastructure that can trigger incidents leading to downtime. We realized that there was an opportunity to help our future customers.

Our thesis was:

Customers care about availability and want to proactively prevent errors and not wait until after the impact, which wastes time and effort, and risks reputation and credibility.
If an Operational Risk has already materialized into a disruption somewhere in the world, it is highly likely that it will materialize over and over again in many enterprises, and cause operational pain and loss.
Customers want to learn and not repeat a mistake that has already caused others harm, but there is no simple, automated, and trusted way for them to learn from each other and avoid known risks.

That's where Chkk comes in.

We took inspiration from cybersecurity, where security vulnerabilities are reported publicly, and came up with this simple idea: If there's any error, failure, or disruption that has happened anywhere in the world, we will learn about it. We’ll convert it into a Risk Signature, similar to a virus signature, and then we will stream it to all our customers, where it will be scanned in their environments. That way, our customer can proactively detect, identify, and remediate Operational Risks before they cause disruptions, much like antivirus software detects and removes viruses before they start causing harm.

With Chkk, our customers learn about Operational Risks from an authoritative source and proactively prevent these incidents from happening altogether.

How the Chkk Operational Safety Platform works

Our first product is a SaaS service designed for organizations that are running mission-critical applications on Kubernetes infrastructure. We help them reduce Operational Risks, prevent errors and disruptions, and operate Kubernetes safely and efficiently. Not only do we identify and prioritize risks, we also provide Preverified Upgrade Plans to our customers, so they can cut down weeks of preparation prework to days, and safely remediate these risks without worrying about the complexities and intricate interdependencies that exist when fixing these issues.

There are three distinct modules in the Chkk Operational Safety Platform.

Upgrade Copilot is especially valuable for Platform, DevOps, and SRE Engineers responsible for planning and executing infrastructure upgrades. We provide Preverified Upgrade Plans containing a detailed sequence of steps that need to be executed for remediation. We then optionally verify these steps on a digital twin of their infrastructure, executing the prescribed sequence of steps, to validate that the plan works as expected. This significantly reduces the time and effort required for planning these upgrades and also derisks the execution of this critical task for our customers.

‍Artifact Register maintains an inventory of all components, container images, repositories, and tools across multiple clusters and clouds. It gives our customers visibility into what exists where, reducing the need for manual and error-prone tracking using spreadsheets and scripts that they currently use.

Risk Ledger is similar to security risk ledgers, but tailored specifically towards identifying contextualized Operational Risks within Kubernetes infrastructures. It enables our customers to become proactive in addressing potential failures before they happen.

All modules of Chkk seamlessly integrate with existing workflows and tools (IaC, packaging, deployment, monitoring, ticketing, and alerting) and simplify existing operational processes.

Powered by Collective Learning

Many of our customers ask us: how do you learn about all the issues and Operational Risks? How do you make sure that your remediations and Upgrade Plans are safe to execute? How do you manage these intractable problems? What’s the magic?

The magic is our Collective Learning Technology.

At the heart of Collective Learning is the Risk Signature Database, or RSig DB. Think of it as a CVE database for Availability Risks, along with a Knowledge Graph that captures all the relationships across different artifacts – issues, release notes, and any and all breaking changes.

On the backend, our technology continuously sources and populates this RSig DB and Knowledge Graph from multiple sources. First and foremost, we mine the internet for publicly available information – incidents, reports, tickets, issues, and discussions on internet forums. We scour everything where we can find a signal. Our research team then validates these candidates and converts them into programmatic signatures that can later be scanned and contextualized against a customer’s infrastructure.

We also ingest release notes, breaking changes, and bug report feeds from Kubernetes add-on vendors and open-source projects into our RSig DB and the Knowledge Graph. And of course, we also learn from our users. We continuously add these learnings to our RSig DB and Knowledge Base, which become more valuable for our customers over time.

All Chkk modules use the Database and Knowledge Graph to identify and prioritize risks, and locate them with pinpoint accuracy within a Kubernetes fleet. We also use them to create and preverify the Upgrade Plans that our customers use to remediate these issues.

We’ve taken time and care to build and fine tune the technology to prioritize and address the right risks. Our customers appreciate that we offer concise actionable plans to resolve the most critical risks, rather than burdening them with an exhaustive list of unnecessary ones.

A bright future ahead

In order to build a future powered by Collective Learning, Chkk has raised $5.2 million in seed funding from angels and VCs led by Sequoia Capital. We are grateful that Sequoia believes in our mission and is joining us in democratizing the wisdom of operating software at scale.

We have built the Chkk Operational Safety Platform for our customers running mission-critical apps on Kubernetes infrastructure. It helps Platform, DevOps, and SRE teams proactively manage and remediate risks, execute safe upgrades, eliminate wasted effort, and accomplish more with fewer resources.

The Chkk Operational Safety Platform is available today – it installs in minutes and integrates into your existing tools and workflows. Please sign up to get started.

‍

Continue reading

Spotlight

Spotlight: CoreDNS Upgrades with Chkk

by

Chkk Team

Technology

Karpenter vs. Cluster Autoscaler

by

Chkk Team

Spotlight

Spotlight: Simplifying Contour Upgrades with Chkk

by

Chkk Team

We value your privacy

Launching Chkk Operational Safety Platform

Turbo charge your Upgrades

Instantly assess upgrade complexity. No credit card required.

Our core thesis

How the Chkk Operational Safety Platform works

Powered by Collective Learning

A bright future ahead

Continue reading

Spotlight: CoreDNS Upgrades with Chkk

Karpenter vs. Cluster Autoscaler

Spotlight: Simplifying Contour Upgrades with Chkk

Instantly assess upgrade complexity.
No credit card required.