We’ve worked together for more than a decade—at AWS and on the team that built Amazon Elastic Kubernetes Service (EKS)—and have seen firsthand the challenges Platform, SRE, and DevOps teams face running Cloud Native infrastructure at scale. Chkk was born out of those experiences.
At AWS, Fawad Khaliq, as an operator of Amazon EKS, noticed the same classes of errors, disruptions, and failures repeating across customers—especially in customer-owned platform layers—with no practical way for teams to learn from one another. Meanwhile, Ali Khayam and Awais Nemat were operating network infrastructure services with “always available, no downtime” expectations, where a proactive approach was essential to keep mission-critical systems running across one of the largest infrastructure footprints on the planet.
We saw a recurring challenge: modern Cloud Native environments are deeply interconnected and fragile. Running hundreds of services and applications with tight interdependencies means even the smallest change can ripple into major disruptions. Most teams are forced into a reactive posture, firefighting after issues arise. Automation to make changes safely is limited, and there’s no reliable way to validate upgrades or configuration changes ahead of time. Engineers end up spending weeks—or even months—researching, planning, and rehearsing just to minimize the risks of what should be routine infrastructure updates.
This inspired us to start Chkk and democratize the collective wisdom of operating infrastructure at scale for everyone. The internet holds a goldmine of knowledge about how modern infrastructure really works—hidden in official documentation, source code, changelogs, issues, blogs, and forums. Yet this knowledge isn’t organized for machines to act on, and humans can’t keep up with the flood of updates and discussions. Too often, vital lessons remain buried, leaving teams to repeat the same mistakes.
Our mission is to enable engineering teams to ship infrastructure changes safely and with confidence—without repeating known mistakes and risks.