Videos uploaded by user “USENIX”
USENIX Security '18-Q: Why Do Keynote Speakers Keep Suggesting That Improving Security Is Possible?
James Mickens, Harvard University Q: Why Do Keynote Speakers Keep Suggesting That Improving Security Is Possible? A: Because Keynote Speakers Make Bad Life Decisions and Are Poor Role Models Some people enter the technology industry to build newer, more exciting kinds of technology as quickly as possible. My keynote will savage these people and will burn important professional bridges, likely forcing me to join a monastery or another penance-focused organization. In my keynote, I will explain why the proliferation of ubiquitous technology is good in the same sense that ubiquitous Venus weather would be good, i.e., not good at all. Using case studies involving machine learning and other hastily-executed figments of Silicon Valley’s imagination, I will explain why computer security (and larger notions of ethical computing) are difficult to achieve if developers insist on literally not questioning anything that they do since even brief introspection would reduce the frequency of git commits. At some point, my microphone will be cut off, possibly by hotel management, but possibly by myself, because microphones are technology and we need to reclaim the stark purity that emerges from amplifying our voices using rams’ horns and sheets of papyrus rolled into cone shapes. I will explain why papyrus cones are not vulnerable to buffer overflow attacks, and then I will conclude by observing that my new start-up papyr.us is looking for talented full-stack developers who are comfortable executing computational tasks on an abacus or several nearby sticks. View the full USENIX Security '18 program at https://www.usenix.org/usenixsecurity18/technical-sessions
Views: 112959 USENIX
Fork Yeah! The Rise and Development of illumos
Fork Yeah! The Rise and Development of illumos Bryan M. Cantrill, Joyent In August 2010, illumos, a new OpenSolaris derivative, was born. While not at the time intended to be a fork, Oracle sealed the fate of illumos when it elected to close OpenSolaris: by choosing to cease its contributions, Oracle promoted illumos from a downstream repository to the open source repository of record for such revolutionary technologies as ZFS, DTrace, and Zones. This move accelerated the diaspora of kernel engineers from the former Sun Microsystems, many of whom have landed in the illumos community, where they continue to innovate. We will discuss the history of illumos but will focus on its promising future.
Views: 159674 USENIX
Keys to SRE
Ben Treynor Presented at SREcon14
Views: 21785 USENIX
Programming Style and Your Brain
Douglas Crockford, PayPal Computer programs are the most complicated things humans make. They must be perfect, which is hard for us because we are not perfect. Programming is thought to be a "head" activity, but there is a lot of "gut" involved. Indeed, it may be the gut that gives us the insight necessary for solving hard problems. But gut messes us up when it come to matters of style.The systems in our brains that make us vulnerable to advertising and propaganda also influence our programming styles. This talk looks systematically at the development of a programming style that specifically improves the reliability of programs. The examples are given in JavaScript, a language with an uncommonly large number of bad parts, but the principles are applicable to all languages. Douglas Crockford was born in the wilds of Minnesota, but left when he was only six months old because it was just too damn cold. He turned his back on a promising career in television when he discovered computers. He has worked in learning systems, small business systems, office automation, games, interactive music, multimedia, location-based entertainment, social systems, and programming languages. He is the inventor of Tilton, the ugliest programming language that was not specifically designed to be an ugly programming language. He is best known for having discovered that there are good parts in JavaScript. This was an important and unexpected discovery. He also discovered the JSON Data Interchange Format, the world's best loved data format.
Views: 3585 USENIX
A Security Analysis of the APCO Project 25 Two-Way Radio System
Why (Special Agent) Johnny (Still) Can't Encrypt: A Security Analysis of the APCO Project 25 Two-Way Radio System Refereed Paper presented by Matt Blaze (University of Pennsylvania) at the 20th USENIX Security Symposium (USENIX Security '11), held August 8--12, 2011, in San Francisco, CA. Awarded Outstanding Paper Authors: Sandy Clark, Travis Goodspeed, Perry Metzger, Zachary Wasserman, Kevin Xu, and Matt Blaze, University of Pennsylvania Abstract: APCO Project 25 ("P25") is a suite of wireless communications protocols used in the US and elsewhere for public safety two-way (voice) radio systems. The protocols include security options in which voice and data traffic can be cryptographically protected from eavesdropping. This paper analyzes the security of P25 systems against both passive and active adversaries. We found a number of protocol, implementation, and user interface weaknesses that routinely leak information to a passive eavesdropper or that permit highly efficient and difficult to detect active attacks. We introduce new selective subframe jamming attacks against P25, in which an active attacker with very modest resources can prevent specific kinds of traffic (such as encrypted messages) from being received, while emitting only a small fraction of the aggregate power of the legitimate transmitter. We also found that even the passive attacks represent a serious practical threat. In a study we conducted over a two year period in several US metropolitan areas, we found that a significant fraction of the "encrypted" P25 tactical radio traffic sent by federal law enforcement surveillance operatives is actually sent in the clear, in spite of their users' belief that they are encrypted, and often reveals such sensitive data as the names of informants in criminal investigations.
Views: 24144 USENIX
Comprehensive Experimental Analyses of Automotive Attack Surfaces
Refereed Paper presented by Stephen Checkoway (University of California, San Diego) at the 20th USENIX Security Symposium (USENIX Security '11), held August 8--12, 2011, in San Francisco, CA. Authors: Stephen Checkoway, Damon McCoy, Brian Kantor, Danny Anderson, Hovav Shacham, and Stefan Savage, University of California, San Diego; Karl Koscher, Alexei Czeskis, Franziska Roesner, and Tadayoshi Kohno, University of Washington Abstract: Modern automobiles are pervasively computerized, and hence potentially vulnerable to attack. However, while previous research has shown that the internal networks within some modern cars are insecure, the associated threat model — requiring prior physical access — has justifiably been viewed as unrealistic. Thus, it remains an open question if automobiles can also be susceptible to remote compromise. Our work seeks to put this question to rest by systematically analyzing the external attack surface of a modern automobile. We discover that remote exploitation is feasible via a broad range of attack vectors (including mechanics tools, CD players, Bluetooth and cellular radio), and further, that wireless communications channels allow long distance vehicle control, location tracking, in-cabin audio exfiltration and theft. Finally, we discuss the structural characteristics of the automotive ecosystem that give rise to such problems and highlight the practical challenges in mitigating them.
Views: 24601 USENIX
USENIX Security ’17 - Understanding the Mirai Botnet
Manos Antonakakis, Georgia Institute of Technology; Tim April, Akamai; Michael Bailey, University of Illinois, Urbana-Champaign; Matt Bernhard, University of Michigan, Ann Arbor; Elie Bursztein, Google; Jaime Cochran, Cloudflare; Zakir Durumeric and J. Alex Halderman, University of Michigan, Ann Arbor; Luca Invernizzi, Google; Michalis Kallitsis, Merit Network, Inc.; Deepak Kumar, University of Illinois, Urbana-Champaign; Chaz Lever, Georgia Institute of Technology; Zane Ma and Joshua Mason, University of Illinois, Urbana-Champaign; Damian Menscher, Google; Chad Seaman, Akamai; Nick Sullivan, Cloudflare; Kurt Thomas, Google; Yi Zhou, University of Illinois, Urbana-Champaign The Mirai botnet, composed primarily of embedded and IoT devices, took the Internet by storm in late 2016 when it overwhelmed several high-profile targets with massive distributed denial-of-service (DDoS) attacks. In this paper, we provide a seven-month retrospective analysis of Mirai’s growth to a peak of 600k infections and a history of its DDoS victims. By combining a variety of measurement perspectives, we analyze how the botnet emerged, what classes of devices were affected, and how Mirai variants evolved and competed for vulnerable hosts. Our measurements serve as a lens into the fragile ecosystem of IoT devices. We argue that Mirai may represent a sea change in the evolutionary development of botnets—the simplicity through which devices were infected and its precipitous growth, demonstrate that novice malicious techniques can compromise enough low-end devices to threaten even some of the best-defended targets. To address this risk, we recommend technical and nontechnical interventions, as well as propose future research directions. View the full program: https://www.usenix.org/sec17/program
Views: 6057 USENIX
GPFS Native RAID for 100,000-Disk Petascale Systems
"GPFS Native RAID for 100,000-Disk Petascale Systems", by Veera Deenadhayalan, IBM Almaden Research Center **Disclaimer: The views and opinions expressed in this video are those of the speaker(s) and do not necessarily reflect the views of the USENIX Association.** GPFS (General Parallel File System) is widely used in HPC systems and GPFS. Native RAID (GNR) is a newly added, robust RAID layer tightly integrated into GPFS. GNR effectively utilizes the multiple CPU cores of modern IO servers to eliminate the hardware cost, firmware hassles, and maintenance associated with standalone RAID controllers. To effectively deal with a 100,000-disk petascale system that is expected to experience disk failures on a daily basis, GNR uses declustered RAID to lower the impact of RAID rebuild operations, 3-fault-tolerant redundancy codes, comprehensive asynchronous disk diagnostics, and end-to-end checksum protection to meet the cost, reliability, and integrity goals of the system.
Views: 11172 USENIX
SREcon18 Americas - Real World SLOs and SLIs: A Deep Dive
Matthew Flaming and Elisa Binette, New Relic If you've read almost anything about SRE best practices, you've probably come across the idea that clearly defined and well-measured Service Level Objectives (SLOs) and Service Level Indicators (SLIs) are a key pillar of any reliability program. SLOs allow organizations and teams to make smart, data-driven decisions about risk and the right balance of investment between reliability and product velocity. But in the real world, SLOs and SLIs can be challenging to define and implement. In this talk, we’ll dive into the nitty-gritty of how to define SLOs that support different reliability strategies and modalities of service failure. We’ll start by looking at key questions to consider when defining what “reliability” means for your organization and platform. Then we'll dig into how those choices translate into specific SLI/SLO measurement strategies in the context of different architectures (for example, hard-sharded vs. stateless random-workload systems) and availability goals. Sign up to find out more about SREcon at https://srecon.usenix.org
Views: 1217 USENIX
LISA17 - Managing SSH Access without Managing SSH Keys
Niall Sheridan, Intercom Everyone uses SSH to manage their production infrastructure, but it's really difficult to do a good job of managing SSH keys. Many organisations don't know how many SSH keys have access to production systems or how protected those keys are. A trusted SSH private key can be years old, unprotected by passphrase, and shared among multiple people who may not even work for you. With some tooling and configuration SSH keys can be replaced with limited-use ephemeral certificates, issued centrally and with better access controls and automatic key expiration, solving many of the shortcomings of using SSH keys. This talk will cover: Managing SSH keys: The bad parts Replacing SSH keys with ephemeral certificates: how & why Discussion of an implementation of a CA for SSH certificates Call for participation, showing github source View the full LISA17 program: https://www.usenix.org/lisa17/program
Views: 1826 USENIX
Continuous Deployment with Ansible
Tim Gerla, AnsibleWorks Presented at the 2013 USENIX Configuration Management Summit (UCMS '13) Continuous Deployment is the natural extension of Continuous Integration: immediately deploying tested and validated code to a production environment. To achieve this goal, you'll have to use best-of-breed tools and practices. In this talk, we'll show about how to use Ansible to achieve continuous deployment of software infrastructure with zero downtime (on a multi-tier application stack), integrating with tools like Jenkins, monitoring systems, and load balancers to accomplish seamless rolling updates. Ansible is an open source configuration management, software deployment, and IT orchestration framework. It is used to eliminate manual IT processes of all kinds. Ansible uses SSH by default to manage remote machines, requiring no agent installation, bootstrapping, or root level network daemons. Ansible uses a data driven automation language called playbooks, which are intended to be easy to audit and write for users of all technical levels.
Views: 78331 USENIX
SREcon19 Americas - What Breaks Our Systems: A Taxonomy of Black Swans
Laura Nolan, Slack Black swan events: unforeseen, unanticipated, and catastrophic issues. These are the incidents that take our systems down, hard, and keep them down for a long time. By definition, you cannot predict true black swans. But black swans often fall into certain categories that we've seen before. This talk examines those categories and how we can harden our systems against these categories of events, which include unforeseen hard capacity limits, cascading failures, hidden system dependencies, and more. Sign up to find out more about SREcon at https://srecon.usenix.org
Views: 1180 USENIX
USENIX ATC '17: Visualizing Performance with Flame Graphs
Brendan Gregg, Senior Performance Architect, Netflix Flame graphs are a simple stack trace visualization that helps answer an everyday problem: how is software consuming resources, especially CPUs, and how did this change since the last software version? Flame graphs have been adopted by many languages, products, and companies, including Netflix, and have become a standard tool for performance analysis. They were published in "The Flame Graph" article in the June 2016 issue of Communications of the ACM, by their creator, Brendan Gregg. This talk describes the background for this work, and the challenges encountered when profiling stack traces and resolving symbols for different languages, including for just-in-time compiler runtimes. Instructions will be included generating mixed-mode flame graphs on Linux, and examples from our use at Netflix with Java. Advanced flame graph types will be described, including differential, off-CPU, chain graphs, memory, and TCP events. Finally, future work and unsolved problems in this area will be discussed. View the entire USENIX ATC '17 program at https://www.usenix.org/conference/atc17/program
Views: 8659 USENIX
SREcon16 - Performance Checklists for SREs
Brendan Gregg, Netflix There's limited time for performance analysis in the emergency room. When there is a performance-related site outage, the SRE team must analyze and solve complex performance issues as quickly as possible, and under pressure. Many performance tools and techniques are designed for a different environment: an engineer analyzing their system over the course of hours or days, and given time to try dozens of tools: profilers, tracers, monitoring tools, benchmarks, as well as different tunings and configurations. But when Netflix is down, minutes matter, and there's little time for such traditional systems analysis. As with aviation emergencies, short checklists and quick procedures can be applied by the on-call SRE staff to help solve performance issues as quickly as possible.
Views: 9725 USENIX
The DevOps Transformation
Keynote Address at the 25th Large Installation System Administration Conference (LISA '11), by Ben Rockwood, Joyent. **Disclaimer: The views and opinions expressed in this video are those of the speaker(s) and do not necessarily reflect the views of the USENIX Association.** DevOps may be a new term, but it's not a new idea. in this session we'll deconstruct it into its three transformation phases, look back at the often referenced but rarely explained history that influences it, and see how it is a catalyst that is changing the craft of system administration.
Views: 29863 USENIX
SREcon19 Americas - How Did Things Go Right? Learning More from Incidents
Ryan Kitchens, Netflix Solely learning from failure isn't a fundamental—it's a limitation. A look into the New View of Safety, Human & Organizational Performance, and Resilience Engineering shows us that safety, great performance, and sources of resilience do not come from the absence of failure but rather the presence of adaptive capacity. Navigating a perfect storm in a world where availability is made up and the 9's don't matter requires expertise. This talk will describe more rewarding ways to approach incident investigation without overly focusing on failure prevention. • What's going on when it seems like nothing is happening? • When failure does occur, what's going to keep it from being worse? • How do teams adapt successfully when preventative techniques fail? • How should we prioritize the effort to develop systems that help us safely manage the consequences of failure? • These questions cannot be answered by trying to explain the causes of failure and fixing remediation items. We will move the needle forward and increase our opportunity for learning from success with some fundamental and practical ways that get us from, "Why did things go wrong?" to "How did things go right?" Sign up to find out more about SREcon at https://srecon.usenix.org
Views: 1877 USENIX
A Study of Practical Deduplication
This refereed paper was presented by Dutch T. Meyer of Microsoft Research and the University of British Columbia and William J. Bolosky of Microsoft Research at the 9th USENIX Conference on File and Storage Technologies (FAST '11). Recipient of the Best Paper Award. Abstract: We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of redundancy. We found that whole-file deduplication achieves about three quarters of the space savings of the most aggressive block-level deduplication for storage of live file systems, and 87% of the savings for backup images. We also studied file fragmentation finding that it is not prevalent, and updated prior file system metadata studies, finding that the distribution of file sizes continues to skew toward very large unstructured files.
Views: 3322 USENIX
Phone Phreaks: What We Can Learn From the First Network Hackers?
Phil Lapsley, Hacker, Consultant, Entrepreneur, and Author of Exploding The Phone: The Untold Story of the Teenagers and Outlaws Who Hacked Ma Bell Presented at USENIX Security '14
Views: 3929 USENIX
I Am SysAdmin (And So Can You!)
Ben Rockwood, Joyent Presented at LISA14
Views: 11353 USENIX
LISA17 - Scalability Is Quantifiable: The Universal Scalability Law
Baron Schwartz, VividCortex @xaprb Do you know what scalability really is? It's a mathematical function that's simple, precise, and useful. REALLY useful. It describes the relationship between system performance and load. In this talk you'll learn the function (the Universal Scalability Law), how it describes and predicts system behavior you see every day, and how to use it in practice. I'll show you how to understand the function, how to capture the data you need to measure your own system's behavior (you probably already have that), and how to analyze the data with the USL. You'll leave this talk knowing exactly what scalability is and what causes non-linear scaling. There are two factors, and you'll start seeing those everywhere, too. As a result, when systems don't scale you'll know what kind of problem to look for, and you'll avoid building bottlenecks into your systems in the first place. Final note: this talk requires zero mathematical skill. View the full LISA17 program: https://www.usenix.org/lisa17/program
Views: 1429 USENIX
SREcon17 Europe/Middle East/Africa - OK Log: Distributed and Coördination-Free Logging
Peter Bourgon, Fastly This talk explores the motivation, design, prototype, and optimization of OK Log, a distributed and coördination-free log system for big ol' (cloud-native) clusters. We first motivate the need for a such a system, setting it apart from existing products like Elasticsearch. Then, we carve out a solution in the distributed systems space, paying due homage to the old gremlins of consistency and coördination. Finally, we review the component and architecture model, and demonstrate how it copes with typical operations and failure modes. This talk is about an open-source product, but it is not a product pitch. Instead, it's meant to be a case study of a learning exercise: approaching a deceptively subtle problem domain from first principles, and using methodological software engineering to derive a solution. I hope it inspires others to reach for something more self-actualizing than the plumbing together of databases and message busses. Sign up to find out more about SREcon at https://srecon.usenix.org
Views: 2150 USENIX
NSDI '18 - Prophecy: Accelerating Mobile Page Loads Using Final-state Write Logs
James Mickens, Harvard University Web browsing on mobile devices is expensive in terms of battery drainage and bandwidth consumption. Mobile pages also frequently suffer from long load times due to high-latency cellular connections. In this paper, we introduce Prophecy, a new acceleration technology for mobile pages. Prophecy simultaneously reduces energy costs, bandwidth consumption, and page load times. In Prophecy, web servers precompute the JavaScript heap and the DOM tree for a page; when a mobile browser requests the page, the server returns a write log that contains a single write per JavaScript variable or DOM node. The mobile browser replays the writes to quickly reconstruct the final page state, eliding unnecessary intermediate computations. Prophecy’s server-side component generates write logs by tracking low-level data flows between the JavaScript heap and the DOM. Using knowledge of these flows, Prophecy enables optimizations that are impossible for prior web accelerators; for example, Prophecy can generate write logs that interleave DOM construction and JavaScript heap construction, allowing interactive page elements to become functional immediately after they become visible to the mobile user. Experiments with real pages and real phones show that Prophecy reduces median page load time by 53%, energy expenditure by 36%, and bandwidth costs by 21%. View the full NSDI '18 program: https://www.usenix.org/conference/nsdi18/technical-sessions
Views: 1414 USENIX
Operations at Twitter: Scaling Beyond 100 Million Users
Talk given by John Adams of Twitter at the 24th Large Installation System Administration Conference (LISA '10). John covered many aspects of Twitter's scaling efforts, including: * Finding the weak points in Ruby on Rails and repairing them * In-house peer-to-peer: High-speed deploys across thousands of machines in no time at all * Managing thousands of machines: Why you need a central machine database, now * User management: How do you onboard many new developers and still remain fault-tolerant? * Caching methodologies and Twitter's open source efforts * Asynchronous versus synchronous processing during request lifetime * Life after syslog: What do you do when it won't work anymore?
Views: 23596 USENIX
SREcon16 - Putting Together Great SRE Teams
Kripa Krishnan, Google What kinds of people make up a great SRE team? This talk explores whether SRE just means software/systems engineers, and what value other roles bring to a team. How can you fully utilize specialist roles and diverse skills in your SRE organization?
Views: 5275 USENIX
W32.Duqu: The Precursor to the Next Stuxnet
Eric Chien and Liam OMurchu, Symantec; Nicolas Falliere On October 14, 2011, we were alerted to a sample by the Laboratory of Cryptography and System Security (CrySyS) at Budapest University of Technology and Economics. The threat appeared very similar to the Stuxnet worm from June of 2010 [1]. CrySyS named the threat Duqu [dyü-kyü] because it creates files with the file name prefix "~DQ" [2]. We confirmed Duqu is a threat nearly identical to Stuxnet, but with a completely different purpose of espionage rather than sabotage.
Views: 3305 USENIX
Keynote Address: A Brief History of the BSD Fast Filesystem
Dr. Marshall Kirk McKusick, Author and Consultant Presented at FAST '15
Views: 2106 USENIX
SREcon16 - The Realities of the Job of Delivering Reliability
Rachel Kroll, Facebook
Views: 3863 USENIX
Apache Traffic Server: More Than Just a Proxy
"Apache Traffic Server: More Than Just a Proxy", by Leif Hedstrom, GoDaddy **Disclaimer: The views and opinions expressed in this video are those of the speaker(s) and do not necessarily reflect the views of the USENIX Association.** Apache Traffic Server is an Apache Software Foundation open project, implementing a fast, scalable, and feature-rich HTTP proxy caching server. This presentation will give a solid introduction to the software, its features and capabilities, and how to successfully deploy and use it in your applications. We will discuss several typical use cases, with example setup and configurations.
Views: 10081 USENIX
Lessons of Scale at Facebook
From the 2010 USENIX Annual Technical Conference Keynote Address, Bobby Johnson, Director of Engineering, Facebook, Inc. discusses how in just over six years Facebook has grown from an idea in a dorm room to one of the most visited sites on the Internet. This explosive growth has created enormous technical challenges. He talks about some specific technical challenges Facebook has faced and the general principles they employ when addressing problems of scale. He also discusses how they structure their engineering process and culture to stay on top of unceasing growth while still moving fast to build new products.
Views: 29085 USENIX
Towards Street-Level Client-Independent IP Geolocation
Paper presented by Yong Wang of UESTC and Northwestern University; Daniel Burgener, Marcel Flores, and Aleksandar Kuzmanovic of Northwestern University; and Cheng Huang of Microsoft Research at the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI '11). Abstract: A highly accurate client-independent geolocation service stands to be an important goal for the Internet. Despite an extensive research effort and significant advances in this area, this goal has not yet been met. Motivated by the fact that the best results to date are achieved by utilizing additional 'hints' beyond inherently inaccurate delay-based measurements, we propose a novel geolocation method that fundamentally escalates the use of external information. In particular, many entities (e.g., businesses, universities, institutions) host their Web services locally and provide their actual geographical location on their Websites. We demonstrate that the information provided in this way, when combined with network measurements, represents a precious geolocation resource. Our methodology automatically extracts, verifies, utilizes, and opportunistically inflates such Web-based information to achieve high accuracy. Moreover, it overcomes many of the fundamental inaccuracies encountered in the use of absolute delay measurements. We demonstrate that our system can geolocate IP addresses 50 times more accurately than the best previous system, i.e., it achieves a median error distance of 690 meters on the corresponding data set.
Views: 2023 USENIX
LISA18 - Solving All the Problems with systemd
Alvaro Leiva Geisse, Instagram Abstract: Often system administrators have to choose one of two options: On one end, traditional service management has a service starting with all privileges, and a full view of your system, and on the other end we have containers, with a restrictive, more controlled view of your system. But, with a modern kernel and systemd, it is no longer one or the other, but you can actually take the best of both approaches and decide which components to apply to your service. Do you like the concept of packaging dependencies of containers, but also like the idea of sharing the network with your server from a traditional service manager? Do you want to restrict the access to the files on your system from containers, but also want to be able to manage your service from your server like traditional service management allows you? It turns out that you can have it all. In this presentation I will show all the service techniques to deploy services in Linux that use and abuse systemd, from spinning up a simple service, to actually running your service isolated on a systemd container, and everything in the middle. I'll also show you how to use these features with other traditional techniques, like socket and path activation, service watchdog. scheduling tasks to be executed later on, and what happens when a service goes down. You already have systemd installed on your server...Why not take full advantage of its capacities? I love Python, I grew up in a small town in Chile and one weekend, 16 years ago, I had the flu and could not go out. I decided to learn how to code in Python and that was the beginning of the road that would move us all to Northern California so that I could join the Production Engineering team at Instagram. I also like eating and cooking (in that order). Follow: @aleivag
Views: 1609 USENIX
SREcon18 Europe - The Myth of Cloud Agnosticism
Corey Quinn, Last Week in AWS In theory, the idea of having infrastructure that can seamlessly deploy between different cloud providers is a wonderful concept. Who wouldn't love to migrate workloads seamlessly between providers for a variety of reasons? In theory, a tiger with an anger management problem is just a scaled up house-cat. This talk explores the practical reality of cloud agnosticism, with all of its warts. The financial, technical, and operational complexities introduced by multiple providers can take companies by surprise. Come explore the basic truth of "however much you hate your cloud provider, you will hate the migration process far more." View the full SREcon18 Europe Program at: https://www.usenix.org/conference/srecon18europe/program Sign up to find out more about SREcon at https://srecon.usenix.org
Views: 533 USENIX
SREcon18 Americas - How SREs Found More than $100 Million Using Failed Customer Interactions
Wes Hummel, PayPal This talk will go into PayPal SRE's journey of using data around customer failures (outward-looking) instead of payment data (inward looking) to become a more customer-focused company. The results not only benefitted our millions of merchants and consumers, but also benefitted our company in a significant way. Topics covered will be: The Principles used when starting the initiative The technical implementation of using Failed Customer Interactions (FCIs) The dashboards and visualizations of the data The culture change that occurred in the company as a result of the initiative The tactical efforts needed to get momentum behind this new way of measuring The bad ideas and mistakes we made along the way The results and where we're at today Sign up to find out more about SREcon at https://srecon.usenix.org
Views: 583 USENIX
SREcon18 Europe - SRE Theory vs. Practice: A Song of Ice and TireFire
Corey Quinn, Last Week in AWS, and John Looney, Facebook In many technical talks, you see a speaker from a renowned tech company stand up and describe a perfect utopia of an environment. You look at the perfect environment and dedicated hordes of senior engineers they describe, and you despair of ever getting to that point. Your environment looks nothing like that. Surprise—their environment doesn't really look like that either! In this talk, a speaker from an unnamed tech unicorn describes their amazing environment—and then what they just said gets translated from "thought leader" into plain English for you by an official SREcon translator. Stop feeling sad—everything is secretly terrible! View the full SREcon18 Europe Program at: https://www.usenix.org/conference/srecon18europe/program Sign up to find out more about SREcon at https://srecon.usenix.org
Views: 1486 USENIX
Case Study: Adopting SRE Principles at StackOverflow
Tom Limoncelli, Stack Exchange, Inc. Presented at SREcon15
Views: 848 USENIX
NewSQL vs. NoSQL for New OLTP
"NewSQL vs. NoSQL for New OLTP", by Michael Stonebraker, MIT ** Disclaimer The views and opinions expressed in this video are those of the speaker(s) and do not necessarily reflect the views of the USENIX Association. ** Enterprises once used RDBMs for online transaction processing (OLTP) applications, which we affectionately call OldSQL. New OLTP applications have greater performance requirements; in many modern applications—multiplayer games, gambling, social networks, etc.—OldSQL is cracking under the volume of interactions. I contrast two alternatives to OldSQL: NoSQL, where SQL and ACID are jettisoned for better performance; and NewSQL, where SQL and ACID are retained, and innovative architectures improve performance.
Views: 15736 USENIX
SRE Hiring
Andrew Fong, Dropbox Presented at SREcon15
Views: 1872 USENIX
SREcon17 Europe/Middle East/Africa - Incident Management and Chatops at Shopify
Daniella Niyonkuru, Shopify SREs are expected to be incident management experts. Yet, incident handling is hard, often messy, and exhausting. We encounter new incidents, look up everywhere for possible explanations, sometimes tunnel on symptoms, and, under pressure, forget some good practices. At Shopify, we care not only about handling incidents quickly and efficiently, but also SRE well-being. We have a special IMOC (Incident Manager On Call) rotation and an incident chatbot to assist IMOCs. In this talk, I’ll first explain the IMOC role and how training SREs for this duty is essential to handling incidents well. Our chatbot assists the IMOC by reducing manual effort and context switching. We integrated the bot with our conversation tool and several third-party tools (PagerDuty, StatusPage, Github) to send timely reminders. It also binds the incident to a discussion channel where all communications happen, allows status page updates directly from the chat room, keeps notes and records event times, and generates service disruption content. To avoid burnout for long-running incidents, the chatbot also reaches out to other IMOCs. Our chatbot supports best practices and "streamlines" incident response. Attendees will leave with strategies for incorporating chatbots into their incident management and considerations for automating precisely and smartly. Sign up to find out more about SREcon at https://srecon.usenix.org
Views: 1140 USENIX
NSDI '18 - Azure Accelerated Networking: SmartNICs in the Public Cloud
Daniel Firestone, Microsoft Modern cloud architectures rely on each server running its own networking stack to implement policies such as tunneling for virtual networks, security, and load balancing. However, these networking stacks are becoming increasingly complex as features are added and as network speeds increase. Running these stacks on CPU cores takes away processing power from VMs, increasing the cost of running cloud services, and adding latency and variability to network performance. We present Azure Accelerated Networking (AccelNet), our solution for offloading host networking to hardware, using custom Azure SmartNICs based on FPGAs. We define the goals of AccelNet, including programmability comparable to software, and performance and efficiency comparable to hardware. We show that FPGAs are the best current platform for offloading our networking stack as ASICs do not provide sufficient programmability, and embedded CPU cores do not provide scalable performance, especially on single network flows. Azure SmartNICs implementing AccelNet have been deployed on all new Azure servers since late 2015 in a fleet of 1M hosts. The AccelNet service has been available for Azure customers since 2016, providing consistent 15μs VM-VM TCP latencies and 32Gbps throughput, which we believe represents the fastest network available to customers in the public cloud. We present the design of AccelNet, including our hardware/software co-design model, performance results on key workloads, and experiences and lessons learned from developing and deploying AccelNet on FPGA-based Azure SmartNICs. View the full NSDI '18 program: https://www.usenix.org/conference/nsdi18/technical-sessions
Views: 1894 USENIX
The Turtles Project: Design and Implementation of Nested Virtualization
Presented at the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI '10) held in Vancouver, BC, Canada, October 4-6, 2010. Paper authors: Muli Ben-Yehuda, IBM Research—Haifa; Michael D. Day, IBM Linux Technology Center; Zvi Dubitzky, Michael Factor, Nadav Har'El, and Abel Gordon, IBM Research—Haifa; Anthony Liguori, IBM Linux Technology Center; Orit Wasserman and Ben-Ami Yassour, IBM Research—Haifa Abstract: In classical machine virtualization, a hypervisor runs multiple operating systems simultaneously, each on its own virtual machine. In nested virtualization, a hypervisor can run multiple other hypervisors with their associated virtual machines. As operating systems gain hypervisor functionality—Microsoft Windows 7 already runs Windows XP in a virtual machine—nested virtualization will become necessary in hypervisors that wish to host them. We present the design, implementation, analysis, and evaluation of high-performance nested virtualization on Intel x86-based systems. The Turtles project, which is part of the Linux/KVM hypervisor, runs multiple unmodified hypervisors (e.g., KVM and VMware) and operating systems (e.g., Linux and Windows). Despite the lack of architectural support for nested virtualization in the x86 architecture, it can achieve performance that is within 6-8% of single-level (non-nested) virtualization for common workloads, through multi-dimensional paging for MMU virtualization and multi-level device assignment for I/O virtualization.
Views: 2875 USENIX
SREcon18 Asia/Australia - Isolation without Containers
Tyler McMullen, CTO, Fastly Software Fault Isolation, or SFI, is a way of preventing errors or unexpected behavior in one program from affecting others. Sandboxes, processes, containers, and VMs are all forms of SFI. SFI is a deeply important part of not only operating systems, but also browsers, and even server software. The ways in which SFI can be implemented vary widely. Operating systems take advantage of hardware capabilities, like the MMU (Memory Management Unit). Others, like processes and containers, use facilities provided by the operating system kernel to provide isolation. Some types of sandboxing even use a combination of the compiler and runtime libraries in order to provide safety. Each of the methods of implementing SFI have advantages and disadvantages, but we don't often think of them as different options toward a similar end goal. When we consider the growing prevalence of things like edge computing and "Internet of Things", our common patterns start to falter. In this talk, we'll focus on how sandboxing compilers work. There are important benefits, but also major pitfalls and challenges to making it both safe and fast. We'll talk about machine code generation and optimization, trap handling, memory sandboxing, and how it all integrates into an existing system. This is all based on a real compiler and sandbox, currently in development, that is designed to run many thousands of sandboxes concurrently in server applications. View the full SREcon18 Asia/Australia Program at https://www.usenix.org/conference/srecon18asia/program Sign up to find out more about SREcon at https://srecon.usenix.org
Views: 247 USENIX
USENIX Security '17 - CLKSCREW: Exposing the Perils of Security-Oblivious Energy Management
Adrian Tang, Simha Sethumadhavan, and Salvatore Stolfo, Columbia University Distinguished Paper Award Winner! The need for power- and energy-efficient computing has resulted in aggressive cooperative hardware-software energy management mechanisms on modern commodity devices. Most systems today, for example, allow software to control the frequency and voltage of the underlying hardware at a very fine granularity to extend battery life. Despite their benefits, these software-exposed energy management mechanisms pose grave security implications that have not been studied before. In this work, we present the CLKSCREW attack, a new class of fault attacks that exploit the security-obliviousness of energy management mechanisms to break security. A novel benefit for the attackers is that these fault attacks become more accessible since they can now be conducted without the need for physical access to the devices or fault injection equipment. We demonstrate CLKSCREW on commodity ARM/Android devices. We show that a malicious kernel driver (1) can extract secret cryptographic keys from Trustzone, and (2) can escalate its privileges by loading self-signed code into Trustzone. As the first work to show the security ramifications of energy management mechanisms, we urge the community to re-examine these security-oblivious designs. View the full program: https://www.usenix.org/sec17/program
Views: 2501 USENIX
Project Adam: Building an Efficient and Scalable Deep Learning Training System
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman, Microsoft Research Presented at OSDI '14
Views: 1427 USENIX
Open Compute Project and the Changing Data Center
Ken Patchett, Facebook Presented at LISA14
Views: 29090 USENIX
GameDay: Creating Resiliency Through Destruction
GameDay: Creating Resiliency Through Destruction Jesse Robbins, Opscode, LLC
Views: 2347 USENIX
SREcon18 Americas - Stable and Accurate Health-Checking of Horizontally-Scaled Services
Lorenzo Saino, Fastly This talk explains how Fastly built a distributed health-checking system capable of driving stable traffic allocation, while quickly and accurately identifying failures. The key intuition behind our design is that the common approach of estimating the operational readiness of a service instance based on its state alone leads to inaccurate decisions. Instead, the health of each instance should be evaluated in the context of the whole service: an instance should be classified unhealthy only if its behavior deviates significantly from other instances in a cluster. Our design borrows techniques from machine learning, signal processing and control theory to ensure overall system availability. Attendees will learn: About the challenges involved in health-checking complex services and how to make accurate, timely and stable decisions. How health-checking can be abstracted into a tractable mathematical problem that can be effectively solved by applying known tools and techniques from machine learning, signal processing and control theory. How to implement such a system practically, what the issues are, and the tradeoffs involved. Sign up to find out more about SREcon at https://srecon.usenix.org
Views: 514 USENIX
NSDI '18 - Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization
David Schultz, Google, Inc. This paper presents our design and experience with Andromeda, Google Cloud Platform’s network virtualization stack. Our production deployment poses several challenging requirements, including performance isolation among customer virtual networks, scalability, rapid provisioning of large numbers of virtual hosts, bandwidth and latency largely indistinguishable from the underlying hardware, and high feature velocity combined with high availability. Andromeda is designed around a flexible hierarchy of flow processing paths. Flows are mapped to a programming path dynamically based on feature and performance requirements. We introduce the Hoverboard programming model, which uses gateways for the long tail of low bandwidth flows, and enables the control plane to program network connectivity for tens of thousands of VMs in seconds. The on-host dataplane is based around a high-performance OS bypass software packet processing path. CPU-intensive per packet operations with higher latency targets are executed on coprocessor threads. This architecture allows Andromeda to decouple feature growth from fast path performance, as many features can be implemented solely on the coprocessor path. We demonstrate that the Andromeda datapath achieves performance that is competitive with hardware while maintaining the flexibility and velocity of a software-based architecture. View the full NSDI '18 program: https://www.usenix.org/conference/nsdi18/technical-sessions
Views: 1586 USENIX
SREcon18 Americas - If You Don’t Know Where You’re Going, It Doesn’t Matter How Fast You Get There
Nicole Forsgren and Jez Humble, DevOps Research and Assessment (DORA) The best-performing organizations have the highest quality, throughput, and reliability while also delivering value. They are able to achieve this by focusing on a few key measurement principles, which Nicole and Jez will outline in this talk. These include knowing your outcome measuring it, capturing metrics in tension, and collecting complementary measures… along with a few others. Nicole and Jez explain the importance of knowing how (and what) to measure—ensuring you catch successes and failures when they first show up, not just when they’re epic, so you can course correct rapidly. Measuring progress lets you focus on what’s important and helps you communicate this progress to peers, leaders, and stakeholders, and arms you for important conversations around targets such as SLOs. Great outcomes don’t realize themselves, after all, and having the right metrics gives us the data we need to be great SREs and move performance in the right direction. Sign up to find out more about SREcon at https://srecon.usenix.org
Views: 2527 USENIX
SREcon19 Americas - Case Study: Implementing SLOs for a New Service
Arnaud Lawson, Squarespace Implementing service level objectives (SLOs) effectively is a hard task, especially for a service which not only is new within your engineering and product organizations but also encompasses both a request-driven and a storage subsystem. In this talk, I will discuss our experience defining and measuring service level indicators (SLIs) and objectives for our Ceph Object Storage service. I will describe our approach in specifying service level indicators plus the tradeoffs and implementation decisions we made when it came to measuring various types of SLIs, including availability, latency, and durability. I will also share the lessons learned and benefits gained from our implementation. You will understand why SLOs are crucial for site reliability engineers and service users and will be given some tips on how to implement them for either a request-driven or a storage system. Sign up to find out more about SREcon at https://srecon.usenix.org
Views: 574 USENIX
A Tale of Two Erasure Codes in HDFS
Mingyuan Xia, McGill University; Mohit Saxena, Mario Blaum, and David A. Pease, IBM Research Almaden Presented at FAST '15
Views: 1976 USENIX