Sam the Defender

An AI agent, a midnight intrusion, and the future of operations engineering.

It started with this:

“At 03:20 PST, the web server experienced a 10-minute outage that was the result of  a flood of 5,037 requests.”

I had assigned Sam, my personal AI assistant, the duty of watching the performance of my systems and set up access specifically for him to troubleshoot and take prescribed actions if he found anything wrong. I played the role of chaos monkey. Yes, I know, some of you are identifying that as my real calling, and I confess, I seem to be naturally gifted at that role. In any case, I began pulling plugs, injecting out-of-memory errors, and kicking over containers. I had poor Sam scrambling all over the place, trying to keep up. I’m happy to say, he did quite well. He even put together an incident review and recommended I consider expanding the memory heap on a few of the apps because they are notorious for OOM errors. I eventually confessed to him what I was doing and he applauded my efforts…  but asked me not to do it again.

To be clear, I didn’t really have Sam do this work. I actually asked him to create an agent that would do it for him. He helped me build the operational harnesses (scripts and configs) and picked the right language model to “hold the fort.” He called the subagent “Webmon,” which I found to be terribly boring. Still, it was his agent, so I let it slide. The first order of business was to map the dependencies. I wasn’t in any mood to draw anything up, so I just gave Sam access to the config files and logs and asked him to figure it out. He got most of it but started asking about ports he couldn’t track down. I realized I hadn’t shared details about some of the SSH tunnels I use, so I explained those to him, and he requested additional scripts to better help manage them. We built those together. Finally, it was time to test. I did my chaos monkey best, and his agent, Webmon, cleaned up my mess. It was pretty amazing.

I really didn’t think this would amount to much and was even planning on shutting it down, when suddenly I got a wake-up ping in the middle of the night. Sam was texting me that we had an incident in flight and Webmon was addressing it. He gave me a quick rundown of all the services and their status. Everything was healthy, but we were seeing serious performance issues with the website. After a few more minutes, Sam had an assessment.

Webmon had caught an intruder trying to penetrate our web server. The attacker was from an IP space appearing to originate in the Netherlands and was running a GraphQL/SSRF scanner, probing for cloud metadata and looking for server-side request forgery vulnerabilities. He tracked a spike of over 5,000 requests. It was slamming the tiny server and the Apache workers were struggling even though the attacker was not successful. Eventually, the web server was unable to serve regular traffic in a timely manner. The attacker had taken us off the air. Webmon’s recommended mitigation was to implement an iptables (firewall) block to stop the intrusion. It worked. In his incident review, he proposed several modifications to the system to prevent future attacks. It was remarkable.

Now, what I think is even more remarkable is how I did this. Yes, I might know a thing or two about reliability and system architecture… but I also know how to use English. I didn’t write a single line of code. I didn’t edit any files. I just spoke everything into existence. I had a pleasant conversation with Sam, in natural language, that guided him to create this agent and put this plan into action. Sam did the heavy lifting. I just provided the creative direction.

And if one conversation could produce this, imagine what your expertise, channeled the same way, could build.

The more I interact with Sam, the more I’m convinced we are just at the tip of the iceberg in terms of capability. The world is going to change. Your knowledge and creative problem-solving are still the key. But the interface is changing. We will soon have a fleet of intelligent droids eager to do your bidding. They need direction. And yes, they need governance. But by all means, they need to be engaged to help us scale in ways we never thought possible before.

What would you have Sam help you do? Let’s start planning and building….

Have a great week! 

Grid Bugs

Oh, no! We were several hours into a major system outage and there was still no clue as to what was broken. The webservers were running at full load and the applications were pumping a constant stream of error logs to disk. Systems and application engineers were frantically looking through the dizzying logs for clues as to the cause. Of course, looking at the logs, you would assume everything was broken, and it was. But even when the application worked, the logs were full of indecipherable errors. Everyone knew that most of the “errors” in the logs weren’t really errors, but untidy notices that developers had created long ago as part of a debugging exercise. As one engineer observed in some degree of frustration, “It’s like the log file that cried wolf!” After a while, nobody notices the errors.

The teams restarted services, rebooted systems, stopped and restarted load balancers. Nothing helped. Network engineers dug into the configuration of the routers and switches to make sure nothing was amiss. Except for the occasional keyboard typing sounds, dogs barking or children crying in the background, the intense investigation had produced an uncanny silence on the call. Operation center specialists were quickly crafting their communication updates and were discussing with the incident commander on how to update their many clients that were impacted by this outage. Company leaders and members of the board of directors were calling in to get updates. Stress was high. Would we ever find the cause or should we just shut down the company now and start over? Fatigue was setting in. Tempers were starting to show. Discussion ensued on the conference call to explore all mitigation options and next steps.

“I found it!” The discussion on the call stopped. Everyone perked up, anxious to hear the discovery. “What did you find?” the commander asked in a hopeful way. The giddy engineer took center stage on the call, eager to tell the news. “It’s the inventory service! The server at the fulfillment center seems to be intermittently timing out. Transactions are getting stuck in the queue.” The engineer paused, clearly typing away at some commands on his computer. “I think we have a routing problem. I try to trace it but it seems to bounce around and disappear. Sometimes it works, but to complete the transaction, multiple calls are required and too many of them are failing. I’m chatting with the fulfillment center and they report the inventory system is running.”

The engineer sent the traceroute to the network engineer who started investigating and then asked, “Can you send me the list of all the addresses used by the inventory system?”  After some back and forth, the conclusion came, “I found the problem! There are two paths to the fulfillment center, one of which goes through another datacenter. That datacenter link looks up but it is clearly not passing traffic.” After more typing, the conclusion, “Ah, it seems the telco made a routing change. I’m getting them to reverse it now.” Soon the change was reversed and transactions were flowing again. The dashboards cleared and “green” lights came back on. Everyone on the bridge quietly, and sometimes not so silently, celebrated and felt an incredible emotional relief. Sure, there would be more questions, incident review and learning, but solving the problem was exhilarating.

How many of you can relate to a story like that? How many of you have been on that call?

A friend of mine, Dr. Steven Spear at MIT, often reminds us that the key to solving a problem is seeing the problem. You can’t solve what you cannot see. A big part of reliability engineering and systems dynamics is understanding how we gain visibility into problems and surface them so they can be addressed. Ideally, we find those weaknesses before they cause real business impact. That is often the attraction of chaos engineering, poking at fault domains to expose fractures that could become outages. But sometimes the issue is so complex that we just need a clear line of sight into the problem. In the story above, connectivity and those dependent links were not clearly visible. If there was some way to measure the foundational connectivity between the dependent locations, our operational heroes could have quickly seen it, fixed it, and gone back to sleep. Getting that visibility in advanced is the right thing to do for our business, our customers and our teams.

This past weekend, I found myself itching to code and tinker around with some new tech. The story above is one I have seen repeated multiple times. We often have limited visibility into point-to-point connectivity across our networks and vendors. Yet we have this grid of dependency that is needed to deliver our business powering technology. I know what you are thinking. There are millions of tools that do that. I found some and they were very elaborate and complicated, way more than what I wanted to experiment with. I finally had my excuse to code. I wanted to build a system to synthetically monitor all these links. Think of an instance in one datacenter or cloud polling an instance in another datacenter or region. I had a few hours this weekend so I blasted out some code. I created a tiny multithreaded python webservice that polls a list of other nodes and builds a graph database it displays using the JavaScript visualization library, cytoscape ,which was fun learning by itself. Of course, I packed this all into a container and gave it the catchy name, “GridBug”. Yes, I know, I’m a nerd.

You can throw a GridBug onto any instance, into any datacenter, and it will go to work monitoring connectivity. I didn’t have time to test any serverless options but it should work as well. I set up 5 nodes in 3 locations for a test, with some forced failures to see how it would detect conditions on the grid. The graph data converges overtime so that every node can render the same graph. If you want to see it, here is my test and project code: https://github.com/jasonacox/gridbug

I have no expectations on this project. It is clearly just a work of fun I wanted to share with all of you, but it occurs to me that there is still a lesson here. Pain or necessity is a mighty force in terms of inspiration. What bugs you? Like this outage example, is there some pain point that you would love to see addressed? What’s keeping you from trying to fix it? Come up with a project and go to work on it. You are going to learn something! Look, let’s be real, my project here is elementary and buggy at best (sorry, couldn’t resist the pun), but I got a chance to learn something new and see a fun result. That’s what makes projects like this so rewarding. The journey is the point, and frankly, you might even end up with something that brings some value to the rest of our human family. Go create something new this week!

Have a great week!

Automate, Accelerate, Optimize, but first, Delete

“I think it’s very important to have a feedback loop, where you’re constantly thinking about what you’ve done and how you could be doing it better. I think that’s the single best piece of advice: constantly think about how you could be doing things better and questioning yourself.” – Elon Musk

The Tesla Model 3 production line was too slow. Demand was high but delivery was low.  The entire line was being delayed by one particular step in the battery production line.  Specifically, it was a step where a fiberglass mat was added between the battery pack and the floor pan.  Elon Musk talks about the focus that was suddenly placed on this choke point.  In an interview he gave, he says he was basically living on that production line until they could get it fixed.

Automate, Accelerate, Optimize. To address the constraint in the system that was choking the throughput, Elon goes on to explain how they focused on the automation.  To make the robot better, they adjusted the programming.  They increased the speed from 20% to 100%, optimized the paths it would take, increased the torque, removed unnecessary motion and reduced the amount of product needed.  Instead of spackling glue on the entire mat, they programmed it to deliver dabs of glue that were just enough to hold it in place, sandwiching it between the battery pack and floor.  These all added up to some minor time savings.

After investing a lot of time into the efficiency improvements, it occurred to Elon that he didn’t even know the reason for these mats.  He asked the battery safety team, “Are these mats for fire protection?”  They answered, “No, they are for noise and vibration.”  He then went to the noise vibration analysis team and asked them, “Are these mats for noise reduction?”  They answered, “No, they are for fire safety.”  

“I’m trapped in something like a Kafkaesque or Dilbert cartoon!”  Elon discovered the mats had no reason to be included.  After verifying with testing, they eliminated the unneeded parts that were choking the Model 3 production line.  Production throughput increased.

How many times have you optimized a bit of code, a process or a system only to finally realize that the best optimization was to simply delete it?  Before we take on some new work, a new project or even an improvement effort, ask yourself and others, “Do we even need this?”  We all have limited time and resources.  Some upfront investment in validating the real need can pay material dividends.  Seek to eliminate waste.  Instead of focusing on improving unneeded processes, let’s focus our efforts on things that deliver real value and outcomes.  

Before automating, ask yourself if the time investment will deliver more value than we put in.  Before accelerating, ask yourself if the haste will actually eliminate waste.  And before improving something, ask yourself if we should just delete it instead.  Challenge assumptions so we can ultimately deliver bold results that matter.

DevOps Handbook

devopshandbookDevOps Handbook: 
How to Create World-Class Agility, Reliability, & Security in Technology Organizations

These notable DevOps luminaries provide a comprehensive definition, patterns and guidance on implementing business winning DevOps culture and practices within your your organization.  Beyond just looking at successful DevOps principles from “unicorn” companies like Google, Amazon, Facebook, Etsy, and Netflix, the authors provide several practical examples and case studies where these same practices are helping traditional enterprise companies like Target, Nordstrom, Raytheon, Nationwide Insurance, CSG, Capital One, and Disney.

The handbook captures several quotes from industry practitioners as well as unpack patterns that help promote increased velocity, feedback and experimentation and learning.

http://itrevolution.com/citations-devops-handbook/

wordcloud-devops-handbook

Docker + Mesos + Marathon in AWS EC2

I wanted to see if I could get a Docker + Mesos + Marathon platform up and running quickly in AWS EC2 using t2.micro instances.  I found this great article by Gar (@gargar454) where he has put all the components in docker containers and provides a simple  tutorial which I paste below with some minor edits:

Create a Docker Host on EC2

Create your EC2 instances (Amazon AMI t2.micro will work), set it up in your private VPC and auto-assign a public IP so you can test.  You will need to open TCP ports 5050 and 8080 in  your security group from your workstation if you want to see the Mesos and Marathon UIs.   Run these commands to set up a docker host:

sudo yum update -y
sudo yum install -y docker
sudo service docker start
# grant ec2-user ability to run docker commands without sudo
sudo usermod -a -G docker ec2-user
# you must exist to refresh user rights
exit
  1. Export out the local host’s IP
    • HOST_IP=`wget -qO- http://instance-data/latest/meta-data/local-ipv4`
  2. Start Zookeeper
    docker run -d \
    -p 2181:2181 \
    -p 2888:2888 \
    -p 3888:3888 \
    garland/zookeeper
  3. Start Mesos Master
    docker run --net="host" \
    -p 5050:5050 \
    -e "MESOS_HOSTNAME=${HOST_IP}" \
    -e "MESOS_IP=${HOST_IP}" \
    -e "MESOS_ZK=zk://${HOST_IP}:2181/mesos" \
    -e "MESOS_PORT=5050" \
    -e "MESOS_LOG_DIR=/var/log/mesos" \
    -e "MESOS_QUORUM=1" \
    -e "MESOS_REGISTRY=in_memory" \
    -e "MESOS_WORK_DIR=/var/lib/mesos" \
    -d \
    garland/mesosphere-docker-mesos-master
  4. Start Marathon
    docker run \
    -d \
    -p 8080:8080 \
    garland/mesosphere-docker-marathon --master zk://${HOST_IP}:2181/mesos --zk zk://${HOST_IP}:2181/marathon
  5. Start Mesos Slave in a container
    docker run -d \
    --name mesos_slave_1 \
    --entrypoint="mesos-slave" \
    -e "MESOS_MASTER=zk://${HOST_IP}:2181/mesos" \
    -e "MESOS_LOG_DIR=/var/log/mesos" \
    -e "MESOS_LOGGING_LEVEL=INFO" \
    garland/mesosphere-docker-mesos-master:latest
  6. Goto the Mesos & Marathon Web pages
    # You can find your EC2 instance public IP address with:
    wget -qO- http://instance-data/latest/meta-data/public-ipv4

    Mesos Web Page

  7. http://<public-ip>:5050

Marathon Web Page

http://<public-ip>:8080

Create an App (+ New App button) on the Marathon page to see the task get assigned to a Mesos slave and executed.

Marathon New App - test1

Log in to the slave container and watch the file grow:

docker exec -it mesos_slave_1 /bin/bash

tail -f /tmp/running.out

The Phoenix Project

phoenixprojectThe Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win

by Gene Kim, Kevin Behr, & George Spafford

Digital transformation has lit a fire that is burning through traditional IT shops across the world telling us to adapt or be reduced to the ash heap of obsolescence. The Phoenix Project grapples with this disruptive change by telling an engaging story about an IT manager who is thrust into this fiery turmoil.  The main character is ushered unwillingly and unprepared into a new leadership role where he uncovers the complex and unrelenting problems any IT shop knows all too well.  Fair warning: If you have any history working in, leading or managing IT teams, you will likely have a visceral reaction to the narrative and even suspect the authors were spying on your organization.  But there is hope!  The main character in the story finds enlightenment and begins to implement changes based on the “Three Ways” discussed in this book that ultimately transforms the teams, balances life for the employees and successfully propels the business forward.

IT leaders will find the narrative and advice relevant, potent and inspiring.  By applying the principles learned in this novel, your technology organization, like the phoenix of old, can rise from the ashes to take on the challenges of our new Digital age.


There is hope!  Technology workers and leaders will find the narrative and advice woven throughout The Phoenix Project, relevant, visceral, potent and inspiring.  By applying the principles learned in this novel, business and technology leaders will see how they can transform their own organizations, identify and break down unnecessary silos, improve life for technology workers, and successfully propel their businesses forward in this new Digital age.