Manager, Site Reliability Engineering
At Talkdesk, we are courageous innovators focused on helping organizations around the world create better customer experiences. Our AI-powered cloud contact center solutions optimize our customers’ most critical customer service processes. We are recognized as a Contact Center as a Service (CCaaS) leader by influential research organizations including Gartner. With $498 million in total funding, a valuation of more than $10 Billion, and a ranking of #8 on the Forbes Cloud 100 list, now is the time to be part of the Talkdesk legacy to help accelerate our success in a new decade of transformational growth.
We champion an inclusive and diverse culture representative of the communities in which we live and serve. And, we give back to our community by volunteering our time, supporting non-profits and minimizing our global footprint.
As an SRE Lead, you will set the standard, track the deliverables, and own the outcome of your team. You will also play a critical role in ensuring the reliability, availability and performance of Talkdesk Infrastructure and Mission Critical Services. You will collaborate with development teams to design, build, and maintain scalable and resilient systems. Your primary focus will be on improving/automating operations, improving observability, monitoring system health, and responding to incidents to meet a telco grade service reliability.
- Lead a team of Senior SREs in owning our operational environments
- Organize and drive regular fire drills with engineering teams to practice and validate on-call and operational procedures
- Own the on-call rotation for your team. Including incident response, resolution, post mortems, and system and process improvements.
- Ensure and participate in post-mortem analyses and work/drive mitigation actions to prevent recurrence of incidents.
- Collaborate with software engineering teams to promote best practices in terms of reliability, performance, and scalability during the development and operation lifecycle.
- Continuously improve system performance and reliability through proactive system monitoring, capacity planning and performance tuning.
- Improve and maintain systems observability and alerting to identify, respond to potential issues proactively and reduce Mean Time to Detect (MTTD).
- Guide a team in building and maintaining tools for automation, configuration management, and deployment to improve operational efficiency, reliability and reduce Mean Time to Repair (MTTR).
- Degree in Computer Science, Information Technology, or related fields.
- 3+ years experience as a Site Reliability Engineer or DevOps Engineer for Cloud Platforms. (e.g. AWS, Google, Azure)
- 3+ years Linux experience.
- 3+ years cloud platforms and container technologies (e.g. Docker, Kubernetes).
- Expertise in at least one programming language (e.g. Python, Java/Spring, Ruby)
- Deep Experience with automation tools (e.g. Ansible, Terraform).
- Solid understanding of networking constructs such as firewalls, application firewalls, load balancers
- Solid understanding of distributed systems and microservices architecture at scale.
- Experience with monitoring tools (e.g. New Relic, Prometheus, Grafana, ELK stack) and incident management systems (e.g. Splunk).
- Experience with messaging systems (e.g. Kafka, Rabbit MQ, Amazon MQ)