Grid Bugs

Oh, no! We were several hours into a major system outage and there was still no clue as to what was broken. The webservers were running at full load and the applications were pumping a constant stream of error logs to disk. Systems and application engineers were frantically looking through the dizzying logs for clues as to the cause. Of course, looking at the logs, you would assume everything was broken, and it was. But even when the application worked, the logs were full of indecipherable errors. Everyone knew that most of the “errors” in the logs weren’t really errors, but untidy notices that developers had created long ago as part of a debugging exercise. As one engineer observed in some degree of frustration, “It’s like the log file that cried wolf!” After a while, nobody notices the errors.

The teams restarted services, rebooted systems, stopped and restarted load balancers. Nothing helped. Network engineers dug into the configuration of the routers and switches to make sure nothing was amiss. Except for the occasional keyboard typing sounds, dogs barking or children crying in the background, the intense investigation had produced an uncanny silence on the call. Operation center specialists were quickly crafting their communication updates and were discussing with the incident commander on how to update their many clients that were impacted by this outage. Company leaders and members of the board of directors were calling in to get updates. Stress was high. Would we ever find the cause or should we just shut down the company now and start over? Fatigue was setting in. Tempers were starting to show. Discussion ensued on the conference call to explore all mitigation options and next steps.

“I found it!” The discussion on the call stopped. Everyone perked up, anxious to hear the discovery. “What did you find?” the commander asked in a hopeful way. The giddy engineer took center stage on the call, eager to tell the news. “It’s the inventory service! The server at the fulfillment center seems to be intermittently timing out. Transactions are getting stuck in the queue.” The engineer paused, clearly typing away at some commands on his computer. “I think we have a routing problem. I try to trace it but it seems to bounce around and disappear. Sometimes it works, but to complete the transaction, multiple calls are required and too many of them are failing. I’m chatting with the fulfillment center and they report the inventory system is running.”

The engineer sent the traceroute to the network engineer who started investigating and then asked, “Can you send me the list of all the addresses used by the inventory system?”  After some back and forth, the conclusion came, “I found the problem! There are two paths to the fulfillment center, one of which goes through another datacenter. That datacenter link looks up but it is clearly not passing traffic.” After more typing, the conclusion, “Ah, it seems the telco made a routing change. I’m getting them to reverse it now.” Soon the change was reversed and transactions were flowing again. The dashboards cleared and “green” lights came back on. Everyone on the bridge quietly, and sometimes not so silently, celebrated and felt an incredible emotional relief. Sure, there would be more questions, incident review and learning, but solving the problem was exhilarating.

How many of you can relate to a story like that? How many of you have been on that call?

A friend of mine, Dr. Steven Spear at MIT, often reminds us that the key to solving a problem is seeing the problem. You can’t solve what you cannot see. A big part of reliability engineering and systems dynamics is understanding how we gain visibility into problems and surface them so they can be addressed. Ideally, we find those weaknesses before they cause real business impact. That is often the attraction of chaos engineering, poking at fault domains to expose fractures that could become outages. But sometimes the issue is so complex that we just need a clear line of sight into the problem. In the story above, connectivity and those dependent links were not clearly visible. If there was some way to measure the foundational connectivity between the dependent locations, our operational heroes could have quickly seen it, fixed it, and gone back to sleep. Getting that visibility in advanced is the right thing to do for our business, our customers and our teams.

This past weekend, I found myself itching to code and tinker around with some new tech. The story above is one I have seen repeated multiple times. We often have limited visibility into point-to-point connectivity across our networks and vendors. Yet we have this grid of dependency that is needed to deliver our business powering technology. I know what you are thinking. There are millions of tools that do that. I found some and they were very elaborate and complicated, way more than what I wanted to experiment with. I finally had my excuse to code. I wanted to build a system to synthetically monitor all these links. Think of an instance in one datacenter or cloud polling an instance in another datacenter or region. I had a few hours this weekend so I blasted out some code. I created a tiny multithreaded python webservice that polls a list of other nodes and builds a graph database it displays using the JavaScript visualization library, cytoscape ,which was fun learning by itself. Of course, I packed this all into a container and gave it the catchy name, “GridBug”. Yes, I know, I’m a nerd.

You can throw a GridBug onto any instance, into any datacenter, and it will go to work monitoring connectivity. I didn’t have time to test any serverless options but it should work as well. I set up 5 nodes in 3 locations for a test, with some forced failures to see how it would detect conditions on the grid. The graph data converges overtime so that every node can render the same graph. If you want to see it, here is my test and project code: https://github.com/jasonacox/gridbug

I have no expectations on this project. It is clearly just a work of fun I wanted to share with all of you, but it occurs to me that there is still a lesson here. Pain or necessity is a mighty force in terms of inspiration. What bugs you? Like this outage example, is there some pain point that you would love to see addressed? What’s keeping you from trying to fix it? Come up with a project and go to work on it. You are going to learn something! Look, let’s be real, my project here is elementary and buggy at best (sorry, couldn’t resist the pun), but I got a chance to learn something new and see a fun result. That’s what makes projects like this so rewarding. The journey is the point, and frankly, you might even end up with something that brings some value to the rest of our human family. Go create something new this week!

Have a great week!

Automate, Accelerate, Optimize, but first, Delete

“I think it’s very important to have a feedback loop, where you’re constantly thinking about what you’ve done and how you could be doing it better. I think that’s the single best piece of advice: constantly think about how you could be doing things better and questioning yourself.” – Elon Musk

The Tesla Model 3 production line was too slow. Demand was high but delivery was low.  The entire line was being delayed by one particular step in the battery production line.  Specifically, it was a step where a fiberglass mat was added between the battery pack and the floor pan.  Elon Musk talks about the focus that was suddenly placed on this choke point.  In an interview he gave, he says he was basically living on that production line until they could get it fixed.

Automate, Accelerate, Optimize. To address the constraint in the system that was choking the throughput, Elon goes on to explain how they focused on the automation.  To make the robot better, they adjusted the programming.  They increased the speed from 20% to 100%, optimized the paths it would take, increased the torque, removed unnecessary motion and reduced the amount of product needed.  Instead of spackling glue on the entire mat, they programmed it to deliver dabs of glue that were just enough to hold it in place, sandwiching it between the battery pack and floor.  These all added up to some minor time savings.

After investing a lot of time into the efficiency improvements, it occurred to Elon that he didn’t even know the reason for these mats.  He asked the battery safety team, “Are these mats for fire protection?”  They answered, “No, they are for noise and vibration.”  He then went to the noise vibration analysis team and asked them, “Are these mats for noise reduction?”  They answered, “No, they are for fire safety.”  

“I’m trapped in something like a Kafkaesque or Dilbert cartoon!”  Elon discovered the mats had no reason to be included.  After verifying with testing, they eliminated the unneeded parts that were choking the Model 3 production line.  Production throughput increased.

How many times have you optimized a bit of code, a process or a system only to finally realize that the best optimization was to simply delete it?  Before we take on some new work, a new project or even an improvement effort, ask yourself and others, “Do we even need this?”  We all have limited time and resources.  Some upfront investment in validating the real need can pay material dividends.  Seek to eliminate waste.  Instead of focusing on improving unneeded processes, let’s focus our efforts on things that deliver real value and outcomes.  

Before automating, ask yourself if the time investment will deliver more value than we put in.  Before accelerating, ask yourself if the haste will actually eliminate waste.  And before improving something, ask yourself if we should just delete it instead.  Challenge assumptions so we can ultimately deliver bold results that matter.

DevOps Handbook

devopshandbookDevOps Handbook: 
How to Create World-Class Agility, Reliability, & Security in Technology Organizations

These notable DevOps luminaries provide a comprehensive definition, patterns and guidance on implementing business winning DevOps culture and practices within your your organization.  Beyond just looking at successful DevOps principles from “unicorn” companies like Google, Amazon, Facebook, Etsy, and Netflix, the authors provide several practical examples and case studies where these same practices are helping traditional enterprise companies like Target, Nordstrom, Raytheon, Nationwide Insurance, CSG, Capital One, and Disney.

The handbook captures several quotes from industry practitioners as well as unpack patterns that help promote increased velocity, feedback and experimentation and learning.

Citations from The DevOps Handbook

wordcloud-devops-handbook

DevOps Enterprise Summit – London 2016

I once again had the privilege of attending the DevOps Enterprise Summit.  This time it was in the U.K. at the Hilton Metropole.  I was impressed with the representation and talks from companies and organization across the UK and the rest of Europe:  SAP, ITV, Hiscox, ING, Barclays, HMRC, Zurich, and many more.

Themes that I picked up from these DevOps leaders:

  • People – Its all about People – empathy, org change, transformation
  • Speed – Continuous Integration and Delivery
  • Quality – Investment in DevOps practices often results in higher quality output
  • Agility – Microservices and Flexible Infrastructure
  • Security – Everyone’s responsibility
  • Business – Focus on Product vs. Project with integration with business in transformation (BizDevOps?)

I was honored to speak again and talk about our DevOps journey at Disney.

Jason Cox DOES16 London

Even though I wasn’t able to record my presentation, ComputerWorld UK provided a great write-up of my talk, and even gave me a new title! 🙂

https://twitter.com/DOESsummitEU/status/749000157136650241

There was considerable interest in our journey to DevOps, especially our transition from Operation Specialist to embedded Systems Engineers.

Other Quotes

“If technology is done well it looks like magic”

https://twitter.com/tumble_b/status/748460191935500288

References

Systems strategy chief Jason Cox details Disney’s devops journey – ComputerWorld UK

Tips for DevOps Success from DOES 2016 – ComputerWorld UK

DevOps Across the Pond – London Reprise – ITproPortal

Overcoming the scale-up challenge of enterprise DevOps adoption – ComputerWeekly.com

Docker + Mesos + Marathon in AWS EC2

I wanted to see if I could get a Docker + Mesos + Marathon platform up and running quickly in AWS EC2 using t2.micro instances.  I found this great article by Gar (@gargar454) where he has put all the components in docker containers and provides a simple  tutorial which I paste below with some minor edits:

Create a Docker Host on EC2

Create your EC2 instances (Amazon AMI t2.micro will work), set it up in your private VPC and auto-assign a public IP so you can test.  You will need to open TCP ports 5050 and 8080 in  your security group from your workstation if you want to see the Mesos and Marathon UIs.   Run these commands to set up a docker host:

sudo yum update -y
sudo yum install -y docker
sudo service docker start
# grant ec2-user ability to run docker commands without sudo
sudo usermod -a -G docker ec2-user
# you must exist to refresh user rights
exit
  1. Export out the local host’s IP
    • HOST_IP=`wget -qO- http://instance-data/latest/meta-data/local-ipv4`
  2. Start Zookeeper
    docker run -d \
    -p 2181:2181 \
    -p 2888:2888 \
    -p 3888:3888 \
    garland/zookeeper
  3. Start Mesos Master
    docker run --net="host" \
    -p 5050:5050 \
    -e "MESOS_HOSTNAME=${HOST_IP}" \
    -e "MESOS_IP=${HOST_IP}" \
    -e "MESOS_ZK=zk://${HOST_IP}:2181/mesos" \
    -e "MESOS_PORT=5050" \
    -e "MESOS_LOG_DIR=/var/log/mesos" \
    -e "MESOS_QUORUM=1" \
    -e "MESOS_REGISTRY=in_memory" \
    -e "MESOS_WORK_DIR=/var/lib/mesos" \
    -d \
    garland/mesosphere-docker-mesos-master
  4. Start Marathon
    docker run \
    -d \
    -p 8080:8080 \
    garland/mesosphere-docker-marathon --master zk://${HOST_IP}:2181/mesos --zk zk://${HOST_IP}:2181/marathon
  5. Start Mesos Slave in a container
    docker run -d \
    --name mesos_slave_1 \
    --entrypoint="mesos-slave" \
    -e "MESOS_MASTER=zk://${HOST_IP}:2181/mesos" \
    -e "MESOS_LOG_DIR=/var/log/mesos" \
    -e "MESOS_LOGGING_LEVEL=INFO" \
    garland/mesosphere-docker-mesos-master:latest
  6. Goto the Mesos & Marathon Web pages
    # You can find your EC2 instance public IP address with:
    wget -qO- http://instance-data/latest/meta-data/public-ipv4

    Mesos Web Page

  7. http://<public-ip>:5050

Marathon Web Page

http://<public-ip>:8080

Create an App (+ New App button) on the Marathon page to see the task get assigned to a Mesos slave and executed.

Marathon New App - test1

Log in to the slave container and watch the file grow:

docker exec -it mesos_slave_1 /bin/bash

tail -f /tmp/running.out

DevOps Enterprise Summit 2015

It was great to attend the DevOps Enterprise Summit again this year. The 2015 edition saw more than double the number of attendees of the 2014 conference with presentations from companies all over the world. There is definitely a feeling that DevOps is awakening across the enterprise.

I had the privilege of presenting again on Disney’s Journey,
“Disney DevOps – The Enterprise Awakens.”

Fellow DevOps Avengers from all over the world converged to swap stories, share new insight, technology and encourage each other to keep moving forward as the Force of positive change in our various industries.

DevOps Enterprise | The Agile, Continuous Delivery and DevOps Transformation Summit http://devopsenterprise.io

Some great reviews and observations:

Impressions from DevOps Enterprise Summit 2015

Gene Kim and Others Share What DevOps is Really “All About”

Impressions from DevOps Enterprise Summit 2015 – Micro Hering – Accenture

Infoworld – Gene Kim explains the joy of DevOps

Selection of favorite quotes:

  • “If you name your servers and treat them like pets they all develop individual personalities.” – @jasonacox
  • “Times and conditions change so rapidly that we must keep our aim constantly focused on the future.” – Walt Disney
  • “If you don’t know where you’re going, any road will take you there.” –@jasonacox
  • “Believe in what you do, and make a difference” – @jasonacox

DevOps Enterprise Summit 2014

I had the privilege of speaking at the first annual DevOps Enterprise Summit.

Jason Cox - DOES 2014DevOps Enterprise | The Agile, Continuous Delivery and DevOps Transformation Summit http://devopsenterprise.io

Wired Magazine – Web Insights
DevOps Innovation Takes Place in the Enterprise by Steve Brodie

Redhat Developer Blog
DevOps Enterprise Conference — Day One

Electric-Cloud
DevOps Enterprise Summit Puts Agile Transformation Front and Center
What we learned at DevOps Enterprise Summit 2014

Software Development Times
DevOps Enterprise Summit bashes silos

PuppetLabs Blog
Disney’s DevOps Journey: A DevOps Enterprise Summit Reprise

Tweets
http://seen.co/event/devops-enterprise-the-agile-continuous-delivery-and-devops-transformation-summit-san-francisco-airport-marriott-waterfront-2014-949/highlight/38769

Selection of quotes:

  • “You’ve got to pick technology that transforms. You need to find tools that change the way you think.” – Jason Cox from Disney
  • “It’s all about the people! We have to empower the edge.” – Jason Cox from Disney at #DOES14
  • #DOES14 Jason Cox of Disney: Remove waste. (Motion != Work)
  • Jason Cox #DevOps guy at @disney promotes positive rebellion. Red tape removers! #DOES14
  • “DevOps professionals must be courageous. They must have the candor to say that there are issues and be renegades enough to constantly challenge the status quo.” – Jason Cox

 

The Phoenix Project

phoenixprojectThe Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win

by Gene Kim, Kevin Behr, & George Spafford

Digital transformation has lit a fire that is burning through traditional IT shops across the world telling us to adapt or be reduced to the ash heap of obsolescence. The Phoenix Project grapples with this disruptive change by telling an engaging story about an IT manager who is thrust into this fiery turmoil.  The main character is ushered unwillingly and unprepared into a new leadership role where he uncovers the complex and unrelenting problems any IT shop knows all too well.  Fair warning: If you have any history working in, leading or managing IT teams, you will likely have a visceral reaction to the narrative and even suspect the authors were spying on your organization.  But there is hope!  The main character in the story finds enlightenment and begins to implement changes based on the “Three Ways” discussed in this book that ultimately transforms the teams, balances life for the employees and successfully propels the business forward.

IT leaders will find the narrative and advice relevant, potent and inspiring.  By applying the principles learned in this novel, your technology organization, like the phoenix of old, can rise from the ashes to take on the challenges of our new Digital age.


There is hope!  Technology workers and leaders will find the narrative and advice woven throughout The Phoenix Project, relevant, visceral, potent and inspiring.  By applying the principles learned in this novel, business and technology leaders will see how they can transform their own organizations, identify and break down unnecessary silos, improve life for technology workers, and successfully propel their businesses forward in this new Digital age.