Displaying all 10 episodes
This episode we speak with Kelsey Hightower. Kelsey is a Principal Developer Advocate at Google. Topics include: Promise Theory, is Kubernetes hard, running databases on Kubernetes, the meat cloud, empathy sessions, how Kubernetes has helped standardize Ops practices, learning from failure at scale at Google, and the importance of the Inclusion part of D&I.
This episode we speak with Kolton Andrus, the CEO and co-founder of Gremlin. Topics include: The role of a Call Leader in incidents, using Chaos Engineering as runtime validation, FIT and application level fault injection, Jesse Robbins and early experiments at Amazon, oncall training, Lineage Driven Fault Injection (LDFI), the value of looking at real traffic instead of synthetic transactions, and the challenges people face when starting to do Chaos Engineering.
This episode we speak with Haley Tucker. Haley is a Senior Software Engineer on the Resilience Engineering team at Netflix. Topics include: Running Chaos Engineering experiments as A/B tests, testing dependencies, fallbacks, testing in production, and why Chaos Monkey is less interesting at Netflix now.
This episode we speak with Matthew Simons. Matthew is a Senior Product Development Manager at Workiva and he leads the Quality Assessment team there. Topics include: Supporting and encouraging reliability at Workiva, why Workiva moved from App Engine to EKS, how to tighten the customer feedback loop, how Chaos Engineering can help folks who are oncall, and fatal optimism and the asteroid that may hit the Earth in the year 2181 (it’s real y’all).
This episode we speak with Subbu Allamaraju. Subbu is a Senior Technologist at the Expedia Group. Topics include: Learning from incidents, changing culture, Why Complex Systems Fail, drifting into failure, forming a hypothesis, showing value from your reliability work, and the importance of understanding how your business makes money.
This episode we speak with Adrian Hornsby, a Senior Tech Evangelist at Amazon Web Services. Topics include: Curiosity and breaking things, the cost of downtime, Jesse Robbins and early failure injection at Amazon, making the case to management for Chaos Engineering, forming a hypothesis, and random experiments vs Game Days.
In this episode we speak with Caroline Dickey, a Site Reliability Engineer at Mailchimp. Topics include: Having customer empathy, rolling out a Chaos Engineering program, and some of the Game Days that the Mailchimp team has conducted, including experiments on Kubernetes and etcd.
This episode we speak with Paul Osman, who leads the Site Reliability Engineering team at Under Armour. Topics include: Paul’s beginnings in Chaos Engineering at 500 Pixels, Reliability Engineering at Pager Duty, bootstrapping the Chaos Engineering program at Under Armour, patterns for reliable distributed systems, and tips for spinning up a Chaos Engineering program.
This episode we speak with Michael Kehoe, a Staff Site Reliability Engineer at LinkedIn. Topics include: Site Reliability Engineering, building satellites at NASA, LinkedIn’s Chaos Engineering project called Waterbear, using Chaos Engineering to test autoscaling, running Chaos Engineering experiments as regression tests in a release pipeline, and tips for starting a Chaos Engineering practice at your company.
Welcome to our first episode of Break Things On Purpose, a podcast about Chaos Engineering. Our guests are Tammy Butow, Principal SRE at Gremlin, and Ana Medina, a Chaos Engineer at Gremlin. Topics include: What is Chaos Engineering? Planning an experiment and Blast Radius. Using Chaos Engineering to improve monitoring and onboarding. Game Days and sharing knowledge. Using Chaos Engineering to learn about your systems. Driving adoption of Chaos Engineering in your company.