Senior Site Reliability Engineer
At Talkdesk, we are courageous innovators focused on helping organizations around the world create better customer experiences. Our AI-powered cloud contact center solutions optimize our customers’ most critical customer service processes. We are recognized as a Contact Center as a Service (CCaaS) leader by influential research organizations including Gartner. With $498 million in total funding, a valuation of more than $10 Billion, and a ranking of #8 on the Forbes Cloud 100 list, now is the time to be part of the Talkdesk legacy to help accelerate our success in a new decade of transformational growth.
We champion an inclusive and diverse culture representative of the communities in which we live and serve. And, we give back to our community by volunteering our time, supporting non-profits and minimizing our global footprint.
Our Engineering team employs a micro-service architecture approach.
As a Senior SRE, you will play a critical role in ensuring the reliability, availability and performance of Talkdesk Infrastructure and Mission Critical Services. You will collaborate with development teams to design, build, and maintain scalable and resilient systems. Your primary focus will be on improving/automating operations, improving observability, monitoring system health, and responding to incidents to meet a telco grade service reliability.
- Collaborate with software engineering teams to promote best practices in terms of reliability, performance, and scalability during the development and operation lifecycle.
- Continuously improve system performance and reliability through proactive system monitoring, capacity planning and performance tuning.
- Improve and maintain systems observability and alerting to identify, respond to potential issues proactively and reduce MTTD.
- Build and maintain tools for automation, configuration management, and deployment to improve operational efficiency, reliability and reduce MTTR.
- Participate in post-mortem analyses and work/drive mitigation actions to prevent recurrence.
- Organize and drive regular fire drills with engineering teams to practice and validate on-call and operational procedures
- Participate in the on-call rotation and respond to incidents, working towards their mitigation and resolution.
- Degree in Computer Science, Information Technology, or related fields.
- Strong experience as a Site Reliability Engineer or DevOps Engineer.
- Strong Linux experience.
- Expertise in at least one programming language (e.g. Python, Java)
- Experience with automation tools (e.g. Ansible, Terraform).
- Experience with cloud platforms (e.g. AWS, Azure) and container technologies (e.g. Docker, Kubernetes).
- Solid understanding of networking, distributed systems, and microservices architecture.
- Experience with monitoring tools (e.g. New Relic, Prometheus, Grafana, ELK stack) and incident management systems (e.g. Splunk).
- Experience with performance monitoring and tuning of databases (e.g., Postgresql, MongoDB, CassandraDB)
- Experience with messaging systems (e.g. Kafka, Rabbit MQ, Amazon MQ)