Companies you'll love to work for

55
companies
760
Jobs

Senior Site Reliability Engineer

Talkdesk

Talkdesk

Software Engineering
San Francisco, CA, USA
Posted on Tuesday, August 8, 2023

At Talkdesk, we are courageous innovators focused on helping organizations around the world create better customer experiences. Our AI-powered cloud contact center solutions optimize our customers’ most critical customer service processes. We are recognized as a Contact Center as a Service (CCaaS) leader by influential research organizations including Gartner. With $498 million in total funding, a valuation of more than $10 Billion, and a ranking of #8 on the Forbes Cloud 100 list, now is the time to be part of the Talkdesk legacy to help accelerate our success in a new decade of transformational growth.

We champion an inclusive and diverse culture representative of the communities in which we live and serve. And, we give back to our community by volunteering our time, supporting non-profits and minimizing our global footprint.

Our Engineering team employs a micro-service architecture approach.

As a Senior SRE, you will play a critical role in ensuring the reliability, availability and performance of Talkdesk Infrastructure and Mission Critical Services. You will collaborate with development teams to design, build, and maintain scalable and resilient systems. Your primary focus will be on improving/automating operations, improving observability, monitoring system health, and responding to incidents to meet a telco grade service reliability.

Responsibilities:

  • Collaborate with software engineering teams to promote best practices in terms of reliability, performance, and scalability during the development and operation lifecycle.
  • Continuously improve system performance and reliability through proactive system monitoring, capacity planning and performance tuning.
  • Improve and maintain systems observability and alerting to identify, respond to potential issues proactively and reduce MTTD.
  • Build and maintain tools for automation, configuration management, and deployment to improve operational efficiency, reliability and reduce MTTR.
  • Participate in post-mortem analyses and work/drive mitigation actions to prevent recurrence.
  • Organize and drive regular fire drills with engineering teams to practice and validate on-call and operational procedures
  • Participate in the on-call rotation and respond to incidents, working towards their mitigation and resolution.

Requirements:

  • Degree in Computer Science, Information Technology, or related fields.
  • Strong experience as a Site Reliability Engineer or DevOps Engineer.
  • Strong Linux experience.
  • Expertise in at least one programming language (e.g. Python, Java)
  • Experience with automation tools (e.g. Ansible, Terraform).
  • Experience with cloud platforms (e.g. AWS, Azure) and container technologies (e.g. Docker, Kubernetes).
  • Solid understanding of networking, distributed systems, and microservices architecture.
  • Experience with monitoring tools (e.g. New Relic, Prometheus, Grafana, ELK stack) and incident management systems (e.g. Splunk).
  • Experience with performance monitoring and tuning of databases (e.g., Postgresql, MongoDB, CassandraDB)
  • Experience with messaging systems (e.g. Kafka, Rabbit MQ, Amazon MQ)

The Talkdesk story hinges on empathy and acceptance. It is the shared goal among all Talkdeskers to empower a new kind of customer hero through our innovative software solution, and we firmly believe that the best path to success for our mission is inclusivity, diversity, and genuine acceptance. To that end, we will hire, promote, work along, cheer for, bond with, and warmly welcome into the Talkdesk family all persons without regard to ethnic and racial identity, indigenous heritage, national origin, religion, gender, gender identity, gender expression, sexual orientation, age, disability, marital status, veteran status, genetic information, or any other legally protected status.