Devops Conference Writeups - June 2011
Conference Writeups - June 2011
Intro
When I worked at Transmeta, there was a cultural tradition among the technical staff that, after going to a trade conference on the company dime, you owed everyone a writeup of what you’d seen and learned. The purpose of the writeup was twofold: first, to prove that you’d actually gone to sessions and not spent the previous week drinking beer in Boca Raton on the company’s dime; second, to share any cool new tidbits with your co-workers who weren’t able to go. There were interesting follow-on effects too: knowing you’d have to do a writeup encouraged taking good notes instead of reading your email when your attention started to wander, plus, broadcasting info about new technologies or approaches that you found interesting tended to increase the chance they’d actually get used.
So it’s in that spirit I’ve written up my experiences at Velocity 2011, which I attended Wednesday June 15, and Day 1 of Devopsdays, which I attended Friday June 17. Throughout I’ve flagged actionable stuff to work on or investigate further with ‘XXX’ tags.
Velocity
Keynote
First up on Wednesday morning was a keynote by Mark Burgess, the author of cfengine and an important figure in the field of systems automation and configuration management. I’ve heard Mark speak before at conferences and participated heavily in cfengine development and bug-fixing for several years so I was excited to hear his current thinking on the state of the field. The central metaphor for his talk was Change = Mass x Velocity. Mass, he explained, could be both helpful and harmful — as muscle (CPU power or money) it could help get work done, but as dead-weight (bureaucracy, policy) can slow down change. Velocity in this formula is a measure of how quickly an organization can move to effect change. Mark contrasted a hang glider which needs small slight movements in order to remain stable and whose enemy was a sudden downdraft, against a fighter jet which remained stable by having enough engine power to fly upside down and turn on a penny [sic]. The fighter jet was like a nimble web startup with multiple code deployments per day and the hang glider was more like a bank with less tolerance for risk (this seemed to me to strain the metaphor pretty hard). Things took an odd turn toward the end of the talk when he introduced his concept of “promise theory”. Systems that implement promise theory (of which cfengine is the only one in existence) provide predictability through autonomically converging on desired system state. Or something. This part was a bit hard to follow. My notes say “knowledge-based systems — what the hell is he on about?” This is a problem I’ve had before with Burgess’ work: it seems very airy and theoretical, disconencted from real world problems. The conclusion, I think, was that the future is going to be complex but wonderful.
Lightning Talks
Lightning talks are brief (5-10 minute) demos of works-in-progress that are interesting and cool but not fully developed enough for their own time-slot.
Up first was the next version of yslow, Yahoo’s web performance analysis tool. I hadn’t seen a demo of the tool itself before so that was neat, and the next version includes a set of sliders you can use on each metric to determine what the best bang for your optimization buck would be. But the author seemed proudest of your ability to share the yslow results for a particular page/site on Facebook, which seemed of dubious utility to me.
Next was another web page performance test/analysis demo for webpagetest.com, then a live demo showing how you could do remote target debugging of javascript running on an iphone/android device with Weinre (pronounced ‘Winery’). The last talk was from HTTP Archive which is now a co-project with the Internet Archive “Wayback Machine” but which tracks performance, metadata and content types rather than actual website content. The presenter showed some interesting trends over time (like diminishing use of Flash in the last six months) and differences between the Top 100 websites and Top 1000 (higher-ranked sites tended to be smaller and faster with greater user of CDNs).
Exhibit Break
The next two talks were more front-end webpage / javascript performance so I stepped out to visit the vendor exhibition area. There were lots of companies with ‘cloud’ in their names, quite a few external end-to-end monitoring sites like Gomez and Keynote, and (unsurprisingly) a large O’Reilly booth with ton of their back-catalogue for perusal on iPads. I complained a bit about the old Safari website and the guys insisted that there was an all-new, fast, much more usable revamp of it now. Also of interest is that all new ORA purchases come with a DRM-free digital edition in your choice of format, if you pay %110 of the purchase price. This offer is also retroactive if you call customer support, so if you own ORA titles and want to transfer them to your iPad, give them a call and they will work it out.
Tim O’Reilly gave a brief talk during this time but it was largely content-free so I won’t summarize here. “You guys are awesome but I am much smarter than most of you” was the take-away.
MCollective for Monitoring
Jeff McCune from Puppet Labs did a live demo of using Mcollective for monitoring. He stepped through the basic client-server model of mcollective, how it uses the message queue for each command to: do discovery to find nodes matching any filter criteria, distribute RPC calls to target hosts, and collect/collate/display the responses. The specific case of a monitoring plugin checks to see whether puppet has run recently on all nodes and raises a single alert with the hostnames of any nodes with stale state files, no catalog, etc. Then to take it a step further (since mcollective only works on hosts which are connected to the bus at the point in time when the message is sent) he showed how registration plugins will send periodic messages with metadata about each node, so a receiver can write those messages out to files. Another nagios plugin simply checks the contents of the registration directory for stale files to find nodes that are no longer connected to the bus. This part of the demo had a little bug because the plugin reported the name of the node it ran on, not the stale node names, but the point was still valid — you need checks, then checks to check those. The only thing new for me was the nagios checks, which (XXX) we should implement as we adopt mcollective.
Yahoo Frontpage Downtime
Jake Loomis, an operations VP at Yahoo, gave a very interesting talk entitled “Why the Yahoo frontpage went down, and why it hadn’t gone down for a decade before that”. He described a “perfect storm” of a week earlier this year when natural disasters, the British royal wedding, and Bin Ladin’s assassination all happened within a few days of each other, driving the Yahoo frontpage traffic to unprecedented levels. He had five broad categories of tips based on both their normal practice for uptime and the lessons learned from this experience.
Design for failure - Yahoo has a system for simplifying the content delivered from highly personalized, localized page rendering to a non-signed-in experience and ultimately down to a static page generated from cron that looks “good enough” to get by but takes almost no effort to serve.
Build a safe environment for change. Techniques they use are continuous integration, running the exact monitoring/alerting in QA as they do in production (XXX !!), periodically forking production traffic and replaying it against new builds to find behavioural changes, and “Dark Launch” methodology where they can deploy code but not activate features it contains in order to make sure there aren’t regressions or unexpected side-effects.
Recover quickly by performing instant rollback. They fail out a datacenter each time they do a large roll so it’s common, well-established practice not a scary one-time thing. This also lets them drop a datacenter that has new, bad code and isolate it quickly. Global load balancing is key to doing this effectively.
Monitor everything, and be able to shed features when a hotspot is detected. XXX The question I have for our service is, how can we degrade gracefully? What concurrency knobs exist and how can we manipulate them real-time if a flood hits? What are the critical services that should be protected and saved first?
Have fallback plans and be able to execute them. XXX Jake’s related point was about developing a method for implementing and then following up on changes that result from post-mortem corrective actions. All too often a post-mortem provides good root cause analysis and good corrective action but without a common practice of testing whether the follow-through completed and actually worked, the same problems can rear up again.
I thought this was a great talk; Jake presented well, took tough questions from the audience, and gave me a lot of food for thought.
Databases in Devops
Robert Treat from Omniti has worked as both DBA and sysadmin and currently consults on integrating databases more closely with agile operations workflow. This presentation was also of keen interest to me as we’ve had a proliferation in databases in our environment and he very quickly expressed a few of the pain points we’ve been experiencing. Beginning with a definition of devops as a focus on three areas — config management, monitoring, and software deployment — Robert expressed a familiar view that DBAs tend to think of systems automation as a hassle because databases don’t generally scale horizontally, so there’s less apparent need for it. There are powerful counterarguments to this view though, which express the value proposition in DBA terms:
- if you’re not scaling horizontally, on one database host there’s a lot to configure.
- change management and auditability is a good thing generally speaking, even if it is only one host.
- and anyway, it rarely actually is one host! there are standbys, slaves, and reporting hosts, all of which feed into DBA responsibilities.
So the question becomes “how confident are you that everything on these hosts has been configured correctly?” And it can be a lot easier at that point to gain DBA buy-in for config management automation.
With respect to monitoring, most of the time we think about system level monitoring like load average, which again can be a hard sell to DBAs. But if we can work to make it easy to add graphs and trending for useful database level metrics like full table scans (queries per second is not a particularly useful metric). One other sticking point Robert had seen was that “DBAs can be finicky about data quality”, and since trending graphs are pretty crude metrics they might not want to use them at all; however, trending and correlation are more important than perfection and an imprecise but consistent metric can still provide useful data. Baselining, comparison over time, capacity planning and forensics are all enabled by having some data to go from. One thing (XXX) Robert’s graphs showed that I thought was helpful were vertical lines overlaid on the graphs to indicate the timing of code pushes and other change events - making it really obvious if there is some correlation. “Password protected metrics are ridiculous. Expose them to everybody.”
Relating to the third the tenet of devops (code deployment), Robert raised some interesting points about the difference between ops, development and databases. A database doesn’t work on a push schedule because the data persists across code pushes, and if it goes away it’s hard to get back — you can write new code but you can’t tell your users to re-input their information. Code is static (same code runs the same everywhere) but data is dynamic… and control of the data is in the hands of the end users.
In closing, Robert gave some tips from his experience for success with implementing devops techniques around databases:
- devs should design schema because they best understand the requirements and application flow
- schema design should have a style guide just like code (table/column naming conventions)
- schema changes should always be migrations so they can be re-applied
- pick reasonable, controlled change windows to do schema changes and decouple them from code pushes; there’s no need to have them synchronized since an unused column can just sit there.
Database Resilience
Justin Sheehy, CTO of the company which develops the open-source decentralized key/value store Riak and former architect at Akamai, gave an interesting talk on resilience in database systems. Resilience is the ability to degrade gracefully rather than “failing gracefully”, which Justin thinks is not a desirable goal. The talk was, he said, generally applicable to resilient systems but specifically talks about Riak’s implementation of the principles under discussion.
As a preface, Justin presented a framework for thinking about failure, called the “harvest / yield” model. “Harvest” is the percentage of the total data set which is currently available; in a sharded database model, the failure of the shard affects harvest. Correspondingly, “Yield” is the percentage chance of getting a useful reply to any request for data; failure of a full-data-set replica with the full data set on it would affect this number.
(Justin mentioned two things that flew by me here, but I wanted to mention as a reminder to look them up: one was CAP Theorem and the other was the Dynamo System from Amazon)
Justin walked through several failure scenarios and described Riak’s mechanism for handling them. In general the approach they take is “zooming out”. That is, take some element that is considered an entire system and zoom out a level to turn that system into just a component of a larger system, with the desirable properties of a component such as loose coupling between its peers and the ability for the system as a whole to continue functioning in event of its failure. Riak does this at each level of its architecture: block storage, storage node, cluster, and site.
I was super impressed by this and want to explore Riak more to find out if it walks the walk as well as its developer talked it up.
Puppet Meetup
Puppet Labs hosted a talk and get-together after the conclusion of the program on Wednesday. I attended and had good conversions with James Turnbull, Jeff McCune, and Luke Kanies about various aspects of Puppet that we’re using or would like to use. Specifically I raised the issue of AAA (Authentication, Authorization, and Accounting) in Mcollective, and how integrating role-based authentication against LDAP is the gating factor for rolling Mcollective out to production. Jeff agreed to do some research on the matter and get back to us to work together when he’s out for training at the end of the month.
Thus endeth Wednesday.
Devopsdays
After a day back in the office Thursday, I headed up to LinkedIn headquarters on Friday for Devopsdays. This is an ‘unconference’ that’s only two years old, piggybacking on the presence of a lot of folks in town for Velocity to bring together many of the primary implementors of the emerging “devops” movement. Devops, as a brief backgrounder, is a philisophical approach to operations which attempts to bridge the traditional IT gap between development and operations by mixing dev into ops (for instance the concept of treating “infrastucture as code” with unit/regression/behaviour tests and revision control methodology more commonly associated with software development) and ops into dev (by bringing an operational obsession with monitoring, metrics and automation to developers’ workflow).
Most of the day’s discussions were in a relaxed panel format, where between 4 and 7 key people in a particular area of specialty would take turns responding to questions from a moderator and from the audience. I mostly liked the format though I think the moderators were too lax on letting questions wander afield and could have clamped down on the more loquatious responders to ensure that everyone got equal time. But overall it was a welcome change from canned presentation format because, concommitant with the open-source nature of the movement itself, it encouraged audience participation and in fact removed the traditional barrier between presenter and audience (this was hugely in evidence at Velocity, where the stage lighting made it difficult for presenters to see the people asking questions).
State of the Devops Union
John Willis from DTO Solutions gave the opening talk to sum up the recent, brief history of devops as an identifiable trend in IT. He traced its origins from three threads: lean startup, an offshoot of Velocity Conf itself, and Agile Infrastructure. The “lean startup” concept comes from a book by Steven Gary Blank called “The Four Steps to the Epiphany” which, filtered through startup culture and codified by Eric Ries, provides a framework for moving quickly, deploying frequently (rather than shrink-wrapping software for slow release cycles) and iterating often. From Velocity came the widespread awareness that, as John put it, “being good at ops is REALLY FREAKIN’ IMPORTANT to your business!” Finally, there’s a strong element of Agile development practice translated into the infrastructure problem space that led to common adoption of technologies like Ruby, continuous integration, and Test- and Behaviour-Driven Development.
John characterized devops as having four main pillars that support it: Culture, Automation, Measurement and Sharing. He emphasised the importance of sharing (as in, sharing ideas online and in the conference itself and the open source codebase of its main tools) as being critical to continued success.
Packaging Panel
The first panel discussion was on packaging and its proper role in agile infrastructure, with positions amongst the panelists diverse enough to strike some sparks during the conversation. Jordan Sissel wrote FPM to address the problem of vendor-provided packages doing more stuff than just delivering files. FPM will easily build a no-frills RPM, Deb, or solaris package out of a tarball or compilation root so you can have native package dependencies but not, say, manage services and add users, which Jordan feels are actions in the domain of config management. Josh Timberman from OpsCode took this a step further and prefers to use Chef to deliver ‘./configure; make; make install’ type instructions. Phil Hollenback from Yahoo has a package-everything type of system but is “over” it, Noah Campbell is writing a book on using RPM ubiquitously and took the (rather extreme) position that everything, even quick-moving config files, can be versioned and delivered by yum.
Phil had a good closing point, derived from his experience running tens-of-thousands of servers at Y!, which I’ll just quote here as I wrote it down: “Just do the stupidest, simplest thing possible. Any time somebody tried to do something clever, it screwed up.”
XXX To investigate: A NetBSD guy gave a quick pop-up shout to their simple cross-platform packaging format (whose name escaped my writing down), and someone mentioned Murder for replicating package repositories.
Orchestration Panel
One of the emerging problems in large-scale distributed systems is that of orchestration: coordinating complicated activities across a group of systems. This panel included James Turnbull from puppetlabs talking about Mcollective, Yan Pujante who was one of the founders of LinkedIn talking about his Glu framework, John Vincent describing Noah, Alex Honor from DTO solutions talking about rundeck, and Michael Hale from Heroku.
Each tool has its own approach but in general the idea of orchestration involves dynamic topology changes rather than changes to an individual host (the term ‘dynamic topology’ is from Yan). You might need orchestration if the state you care about relies on information outside any given host; another way to put it is that orchestration manipulates live state of the service rather than static config.
Some notable quotes from this discussion:
“Scaling up is easy, scaling down is hard.”
“Nodes are a dead concept. We need to get over the idea of individual hosts. What matters is the service.” — James Turnbull
“The idempotent model is good practice for orchestration just as it is for configuration management. Check your preconditions, execute, check results” — Alex Honor (I really liked this concise model of idempotence)
Metrics Panel
This panel was primarily concerned with describign the advancing state of the art in the areas of graphing, trending, and monitoring. My notes are somewhat spotty here because it was nearing lunchtime but I pulled a couple of interesting points out.
Patrick Debois described a very familiar scenario how duplicate parallel monitoring systems can arise, split between ops and dev, not through any maliciousness or ill will just “the way it is”. We have this situation now and I don’t know how to resolve it.
Etsy open-sourced their ‘statsd’ package which they use everywhere to monitor “everything, everywhere, all the time”. John Allspaw and Laurie Denness described the centrality of the tool to their environment: devs instrument their code to it, ops builds dashboards and collates the data, support watches metrics like the frequency of forum posts to look out for trouble. As soon as new metric shows up via statsd, the backend begins graphing it, without human intervention.
Great audience question: “What are the key metrics for YOUR service? What’s the first thing you look at in the morning to tell you at a glance whether things are OK or not. XXX — I’m not sure I know what that is for my service!
Lunch Break
Just wanted to relate an interesting conversation I had over lunch. I didn’t write down his name, but someone was describing using arduino controls to bring haptic/physical interfaces back to their cloud-based service in order to provide a tangible face to it. For instance he has a large dial hooked up to a microcontroller which you can turn in order to bring more capacity online; correspondingly there’s a 7-segment LED array which displays the current percentage of total capacity they’re running. I thought that was very cool :-)
Devops at LinkedIn
LinkedIn’s VP of Operations, perhaps extracting a fee for hosting, got a 15 minute slot after lunch to talk about their journey from startup, closely integrated mode, to an intentional split between development and ops organisations, to a recent reunification at scale. The talk was well done and their experience was close to what we’re going through at work: my only note says “Make The Managers Watch This Video!!!”
Patrick Debois on Vagrant
Patrick (who incidentally came up with the name ‘devops’) demo’ed Vagrant, a Ruby tool for quickly spinning up virtual machines. It’s built on top of VirtualBox and uses the concept of ‘base boxes’ which provide common, functional OS images which you can then customise to meet your local needs. It takes only four commands to get a VM running under Vagrant so the barrier to entry is low and this means it can pervade the whole org and toolchain to make it easy for (say) devs to build a true multi-tier web/app/db stack on different hosts. The idea is to eliminate the “But it works on my workstation!” problem, where assumptions about the underlying infrastructure change dramatically from a developer’s sandbox to the production environment. Very cool stuff. XXX — We should provide vagrant images captured from our kickstarted baseline using Patrick’s “veewee” tool to devs, rather than the current error-prone instructions for kickstarting VMs in the corp desktop LAN off the Pre-production kickstart server.
Agile Professional Services
Josh Timberman from Opscode gave a short talk on how the team of Chef professional services consultants he manages for Opscode implement agile practise in their life. He gave a brief introduction to Agile as it’s evolved for software development first, which I found pretty interesting but won’t recount here. If you’re interested follow along the Agile software development wikipedia page as that covers all the bases (his slides even included the graphics from that page).
The role of QA in Devops
Another panel discussion, for which my notes say “Lengthy and not highly interesting talk about whether and how much QA should exist”. I did transcribe a money quote from Stephen Nelson-Smith:
Most people’s conception of development is fundamentally flawed. They have this idea that developers inherently write bad code and need to be found out. Further, people think that operations can’t be trusted to do anything right. This needs to change.
One other note from this talk: one of the panelists was a founder of ‘blitz.io’ which he described as ‘a site for gamification of capacity planning’. This was baffling at the time but, looking at it now, it does indeed turn capacity planning and load testing into a sort of web-based video game. Bizarre.
Ignite talks
The day shifted gears and people queued up to give Ignite talks, which are driven by their slides auto-advancing every 15 seconds. Jordan Sissel demo’ed a very cool project called “logstash” which provides scalable centralized logging. Dominica DeGrandis talked about applying statistical analysis to downtime. She uses and teaches Kanban to study the variability around the events that make your uptime unpredictable and analyze them using statistical methods. This seemed very helpful and (XXX) worth further investigation - http://t.co/LakijeZ . The best talk however was from Michael Nygard, who presented a case study on a project he’d worked on which went bad. Hopefully this video will be up online soon, but let me share the main take-away from it, which was a metric they used while trying to rescue an abortive launch of their site: MTBHD, or the Mean Time Between Horrifying Discoveries.
Wrap-Up
The day’s final panel discussion was on “Escaping the devops echo chamber” which, as someone adroitly pointed out, actually ended up reinforcing the echo chamber effect; it was introspective to the point of navel-gazing so I didn’t transcribe any of it. Finally the planning began for Saturday’s Open Spaces sessions, where the conference attendees sort of self-organize into working groups about topics that enough people found interesting to have a break-out session about. Puppet Camp worked like this too and they can be very good, kind of like an official “hallway track”. I couldn’t be there Saturday so I didn’t participate here but the session write-ups should be up soon.
Conclusion
The conferences, especially devopsdays, presented tons of material compressed into a very short amount of time. There’s a lot going on in the infrastructure automation space right now, and on some level it’s quite encouraging to see all this attention focused on provisioning automation and configuration management. There are a number of specific changes I’d like to start looking at, based on things I learned about:
- Implement nagios monitoring for mcollective
- Tie monitoring alerting in QA and production closer together
- Explore how our service can degrade gracefully in overwhelming load
- Build process to follow up on corrective actions that come out of post-mortems
- Overlay the timing of code rolls and other events on trending graphs so it’s easy to answer the question “What changed right before this huge spike in service time?”
- Track down and investigate the NetBSD dude’s OS-agnostic package manager
- Figure out the answer to the question “What’s the key metric for your overall service?” If we built a high level red/green dashboard, what two or three graphs would it contain?
- Capture a freshly kickstarted Linux host into virtualbox format for use by Vagrant
- Look at Kanban for statistical analysis of downtime causes
I hope to continue my involvement with this community as it’s helped me a lot with specific technical problems I’ve had in my job; contributing back (with code patches, mailing list participation and posting this writeup) is a way to pay it back for the time I’ve saved in my own work and money I’ve saved the company by implementing better practice, reducing errors and increasing predictability.
UPDATES from around the Net
- The Netbsd package is called pkgsrc (thx Jordan)
- Joshua Timberman writes: “To be clear, I compile from source as the simplest stupidest thing that could work w/o building pkg repos for everything. / I’m not interested in duplicating the functionality [of?] distros for things that might be packaged.”