At Flexera Site Reliability Engineering is responsible for the reliability of our SaaS offerings. This team works with product development to define our Service Level Objectives and performs the work required to ensure we meet those SLOs. These teams employ agile and lean principles in a culture of constant learning and improving
As a Program Manager, you will be responsible for ensuring that SRE is working well with Product Development teams, sync'ing with Product Owners to help plan for the roadmap and identify risk early on and sync'ing with Product Development leads to help identify pain points with our existing systems. You will also be responsible for ensuring that our current product services meet the standard of SRE; ensuring that we have the appropriate metrics in place for a product service, that our deployment pipelines are optimal, that we're adhering to any contractual and regulatory obligations that we may have etc.
Responsibilities
- Own end-to-end availability for a product service
- Work with product service teams to establish SLIs and error budget's, and nurture an environment that appreciates the value that they add
- Identify opportunity for increased monitoring capabilities (white-box & black-box)
- Identify long-term trends for product services (how is my traffic growing over time? How big is the database getting? What does our resource usage patterns look like over time?)
- Ensuring that short-term hacks, are replaced with long-term solutions
- Co-ordinating incident response as part of an on-call rotation, ensuring the SREs aren't being overloaded by on-call, and continually refine the process and tools that enable us to do incident response successfully
- Ensuring that RCAs are being carried out effectively, and that they are being done in a blame-free manner
- Attend the portfolio management team meetings to flag reliability considerations for upcoming work, and to reason about any reliability concerns from other stakeholders
- Populate the SRE backlog
- Identify requirements surrounding load testing, security testing, availability and disaster recovery
- Help mature the delivery process for teams; defining Jenkins pipelines, designing canary release deploys, building in automated fallbacks, optimizing the build chain etc
- Optimise product service code to ensure that it's secure, scalable and performant
- Optimise release engineering code to ensure that it's stable, repeatable and fast
- Improve the fault detection for our services
- Create dashboards which help communicate the metrics for a given product service
- Work with product owners and product engineering teams to perform capacity planning
- Work with product engineering teams to understand performance and behavior patterns
- Help carry out root cause analysis for incidents, and design solutions (both software and human processes) that will help to ensure the same problem doesn't happen in the same way again
Minimum Qualifications
Computer Science degree, or related industry experience managing a mission critical production team for at least 2 years
Critical Skills / Competencies
- Comfortable writing code with one or more of the following languages: Python / Go / Java / C# / C / C++
- Experience working with product owners and product development to prioritise work, flag risk and identify potential production engineering issues (e.g. scalability, resiliency, performance)
- A positive attitude and willingness to learn
- Experience with IaaS and Serverless services from a cloud provider
- An understanding in TCP/IP, DNS and experience designing networks
- Linux system administration experience
- Strong conflict resolution competence
- Excellent written and verbal communication skills
- Experience implementing fault detection, and automating fixes
- An understanding of a range of data storage technologies, including SQL databases
- Experience designing scalable services
- Experience designing distributed, fault-tolerant systems
- Experience managing services in AWS
- Detail oriented. The ideal candidate is one who naturally digs as deep as they need to understand the why
Bonus Skills
The following list of items are not pre-requisites for the role, but might give you a bit more of an idea about what you may expect to come across in your SRE – Program Manager role at Flexera:
- Python / Golang / Java / C# / C / C++ / Bash experience
- Jenkins pipelines
- MSSQL, Informix, Elasticsearch
- Terraform, Packer & Docker
- Zabbix, New Relic, ELK, Prometheus, Datadog
- Security background
This job comes with several perks and benefits
Get your caffeine fix to get you started and keep you going.
Kids are the future, go spend time with them.
We take care of you, even when you are old and wrinkly.
Social gatherings and games; hang out with your colleagues.
Time is precious. Make it count. Morning person or night owl, this job is for you.
Easy access and treehugger friendly workplace.