Fleet management osquery

1/23/2024

In the on-premise world, where individual machines were usually precious pets, we relied on hostcentric configuration management tools like Puppet and, in some cases, Ansible. Sorely missing was a mechanism for bringing existing infrastructure up to our latest and greatest standards, as well as for putting our fleet into a state where the fragmentation of infrastructure choices would be low enough that our internal platform organization could support the total footprint of our infrastructure. It became increasingly common for a team of developers to need to own dozens to hundreds of codebases (in Spotify lingo, “components”), and through acquisitions and reorgs, transfers of ownership became increasingly common, where the team owning a codebase had lost the knowledge about the architectural and infrastructural choices that had led up to its current design. While our engineering teams grew at a somewhat constant rate, the amount of software being created and infrastructure being provisioned grew exponentially compared to the number of developers. The problem was exacerbated with the continued growth of the company. As a result, the infrastructure that had been created over the years started forming a long tail - snapshots in time of whatever was considered the best practice of its day, along with the mistakes made along the way. The cloud provides managed equivalents for all of these - Google Cloud Bigtable and Cloud Pub/Sub, or running Memcached on top of Kubernetes, for example. For example, while we were running on premises, it was considered best practice to host your own Apache Cassandra cluster, deliver messages via Apache Kafka, run custom Elasticsearch or PostgreSQL instances, or install Memcached on dedicated VMs. As we learned more about running in a cloud environment, our best practice recommendations also evolved with our learning. This meant that the particular infrastructure choices developers made in the cloud varied a lot depending on the use case and team. As part of this transition, we employed a “lift and shift” model, where existing services, their architecture, and the relationships with the underlying infrastructure were mostly preserved. Our on-premise originsĪ few years ago, Spotify underwent a transition from being entirely hosted in our own on-premise data centers to instead running on the Google Cloud Platform (GCP). Check out Part 1 of our series on Fleet Management at Spotify and how we manage our software at scale.Īt Spotify, we adopted the declarative infrastructure paradigm to evolve our infrastructure platform’s configuration management and control plane approach, allowing us to manage hundreds of thousands of cloud resources across tens of thousands of different services at scale.

0 Comments

Fleet management osquery

Leave a Reply.

Author

Archives

Categories