Shameful Self Promotion

Post-DevOps: ClusterOps

In the new post-DevOps land we’ve moved passed how to codify infrastructure as code, system administrators are squarely part of the development teams which they enable. In the ClusterOps world, we are building tools to enable companies to treat their systems as a single entity, orchestrating various container technologies across clusters in hybrid clouds, federating them together on a grand scale.

Many small start ups are seeing a great deal of benefit, enabling them to hire less people to manage complicated micro services platforms. On the enterprise scale, I’m seeing large companies benefit from federated clusters across hybrid clouds, allowing them to build complicated charge-back models to their internal teams and save costs by maximizing usability of their cloud-based infrastructure.

The new ClusterOps world is defined by orchestration tools we’ve come to know very well: DC/OS, Kubernetes, and Docker Swarm. These tools allow simplified orchestration of services through various containerization technology and abstracting the complexity of managing many hosts into a single pool of resources. There are many words that folks use today to describe these tools, but nobody will deny that all look like one thing: an operating system.

In a former world, we developed tools to enable system administrators to treat their infrastructure as code. This movement redefined the system administrator as a developer. Automating large, complicated systems by defining their parts in cloud formation templates and automating up to layer 7 with tools like Puppet, Chef and Ansible. When cloud options were not available, we built our own with systems such as OpenStack.

Entire companies were built around enabling companies to succeed in the DevOps space. Companies such as s RedHat and countless other consultancies made bets on the tools available, and leveraged them to build offerings in their vision of “the right way”.

Today, vendors in this space provide these tools overlapping features for free or with licensing. Some companies purchased the intellectual property to build them into an existing licensing model. What this conveys is that these tools were successful. So much so that companies are able to build a tidy profit from them by making them more accessible.

In this way, those companies brought DevOps to the masses. No longer did you have to be that trick Sys Admin with code experience. Anyone could get a streamlined setup with bare metal through layer 7 automation, a fly continuous integration pipeline, top notch monitoring and infrastructure visibility, all for a price.

This conveys that not only were these tools successful, the movement that brought them to the doorsteps of companies large and small was also a winning strategy. The DevOps movement won out over monolithic, siloed engineering.

ClusterOps is the next logical progression in this movement: +DevOps addressed managing singular resources with code; ClusterOps addresses the complicated relationships these resources have with each other.

Tools such as DC/OS and Kubernetes allow us to treat many hosts as a singular entity. They take care of scheduling where, when and how many instances of a given service are running. They abstract away the notion of singular hosts, and replace them with a single point of contact for operators and developers. This single point of contact allow those professionals to treat these systems as a cluster; the interaction, development and operation of this pool of resources is ClusterOps.

ClusterOps vendors are already popping up everywhere. With many consultancies and developer tool companies being built every day to enable their clients with this new way. Just as the success of DevOps was signified by a similar explosion of support and tools for infrastructure as code, the proliferation of companies and tools for cluster operations conveys a similar movement and success story. I’m very excited about this future for the infrastructure engineering space and am looking forward to seeing how it develops.

DC/OS Day 2 Operations: Part 1 - Logging API

This blog post is the first in a 3 part series on day 2 operations for DC/OS. The first part is an introduction to what we mean by “day 2 operations” and the first piece of this product road map, our logging API. Part 2 is on metrics gathering, shipping and integrations with popular metrics analytics solutions. The final part is on debugging and how we intend to build our debugging API for executing interactive sessions from the DC/OS CLI with a running task in the cluster.

DC/OS System Health Monitoring


The Data Center Operating System (DCOS) runs on tens of thousands of hosts. Monitoring the health of all the pieces that make up this amazing application is vital to data center operators and for troubleshoooting hard-to-diagnose bugs. In DCOS v1.7 we’re releasing a new system health API to monitor the core DCOS components (more on the terminology of ‘components’ later). In the futrue we’re hoping to expand the usage of the system health API to other metrics as well as exposing a plugins-style architecture to allow operators to customize system health monitoring.

Building an Installer for the Data Center Operating System


The Mesosphere Data Center Operating System (DCOS) is a distributed, highly available task scheduler. It uses a number of open and closed source projects to make running and administering Apache Mesos as seamless and simple as possible. DCOS runs at scale (we have customers running production deployments of 50,000 nodes) across thousands of machines. This post covers challenges, design, and an overview of the final GUI installer we built to install an operating system for the data center.

How to Deploy Highly Scalable Systems Over SSH


The Secure Shell (SSH) is a well known utility for securely logging into remote hosts. It’s also been widely used as a secure remote execution framework. Today, well known tools in the area of configuration management leverage SSH to safely manage state across thousands of hosts. Many systems administrators reach for SSH when they have to automate execution of scripts across distributed hosts, and others use it every day to log into remote systems on cloud platforms.

Mantle: Encrypted JSON for the Marathon API

Mantle is a go utility that wraps the POST process to Mesosphere’s Marathon API. Before, users had to store JSON with cleartext environment variables for their Docker container configuration. With Mantle, users can encrypt the values for the “env” parameters passed to Marathon using asynchronous public/private key pairs. Mantle is designed to allow operations or deployment teams to build user-level key pairs, and give those public keys to the users' with the most knowledge of the application’s configuration. Those users, can then encode the JSON with Mantle via their public keys and let the deployment team review the JSON and have the final private key to decrypt and deploy to Marathon(s) via Mantle.

Logasaurus: A CLI Utility for Elasticsearch / Logstash

Like most operations teams, at SRC:CLR we’re offloading our logs to an aggregated log solution. We use the popular ELK (Elasticsearch, Logstash, Kibana). I love this solution but when it comes to simply copying and pasting log data from Kibana things get messy. When our developers need to get data quickly it would be easier to have a CLI utility that can do the same queries than having to open a browser and screen grab from Kibana.

Version Management in SOA

Service oriented architectures offer significant increases in agile deployment pipelines. They allow what were traditionally, large, monolithic code bases to be broken down into smaller, more manageable pieces. Instead of diagnosing a single issue that affects the application as a whole, SOA allows developers to troubleshoot smaller, atomic pieces.

A Restful Haproxy Service Abstraction

A major hurdle of microservices is visibility into the versions of your deployed infrastructure. At SRC:CLR we have 7 different micro services plus our platform that drive our product. These services are deployed as immutable infrastructure, their IP’s and configuration is fluid and changing all the time. During a deployment, we might to a canary update of our services, but having to manually query the /info endpoint across ‘n’ number of nodes, IP addresses, and dynamic management port assignments is error prone and difficult. In order to gain visibility into the currently running services, we wrote a tool that finds available services by querying our frontend and internal loadbalancers for running services, and then queries those running services to get their running versions and display them in a lightweight frontend.