Senior Site Reliability Engineer
Talkdesk
At Talkdesk, we are courageous innovators focused on helping organizations around the world create better customer experiences. Our AI-powered cloud contact center solutions optimize our customers’ most critical customer service processes. We are recognized as a Contact Center as a Service (CCaaS) leader by influential research organizations including Gartner. With $498 million in total funding, a valuation of more than $10 Billion, and a ranking of #8 on the Forbes Cloud 100 list, now is the time to be part of the Talkdesk legacy to help accelerate our success in a new decade of transformational growth.
We champion an inclusive and diverse culture representative of the communities in which we live and serve. And, we give back to our community by volunteering our time, supporting non-profits and minimizing our global footprint.
The SRE team at Talkdesk is responsible to build, run, and maintain the components that serve as the infrastructure foundation for the rest of Talkdesk with an automation-first mindset, all while also ensuring high availability and reliability of those components. It also partners with other product engineering teams to help make their services more performant, scalable, observable and reliable.
At Talkdesk we believe in a “you build it you own it” philosophy where every engineering team is responsible for the software they build and deploy. To support this, SRE’s also play a critical role in ensuring that the teams have the tools, practices, and expertise to make that happen in a blame free environment.
As a Talkdesk SRE you will be working with a large distributed and complex infrastructure that spans through multiple regions and cloud providers while using a number of leading edge technologies, for that you will:
Be responsible for:
- Design, build, harden, and maintain the core infrastructure used by all of Talkdesk’s engineering teams
- Automate every aspect of our infrastructure to remove as much as possible any human intervention
- Participate in design reviews of new features, products or infrastructure, in order to guarantee resilience and high availability
- Make sure the infrastructure is running smoothly using observability tools and being proactive on identifying issues
- Develop effective tooling, alerts and processes that allow engineerings to maintain and support their production workloads
- Contribute and disseminate the usage of protocols that promote production readiness and operational excellence
- Participate in on-call rotation for the supported infrastructure alongside all the other engineering teams
- Partner with product engineering teams to debug production outages, write incident post mortems and carry out action items that improve the service resilience
- Drive and contribute for discussions on the evolution and growth of Talkdesk’s infrastructure
Need to have:
- Experience supporting production systems
- Extensive hands on experience of working with AWS
- Good knowledge of Linux/Unix systems
- Strong programming skills in at least one scripting language (e.g. bash, python, etc...)
- Experience with Cloud Formation, Terraform or other Infrastructure code languages/tools
- Experience on supporting messaging systems such as RabbitMQ or Kafka
- Large experience on supporting data stores such as MongoDB, PostgreSQL, MySQL, Redis, Cassandra or Elasticsearch
- Experience with Configuration Management software such as Ansible
- Experience with Monitoring Tools like Datadog, New Relic, Grafana or similar
- Ability to understand of the importance of observability and have good understanding of the most critical metrics and how to measure them
- Ability to identify time consuming or error prone manual tasks for which makes sense to create tooling and automation
- Ability to understand large-scale complex systems from a reliability & availability perspective
- Ability to debug complex issues and identify root causes of instability in a large-scale distributed system
- A software development mindset and apply it to infrastructure management
- Critical thinking over problems and be solution focused
Be valued for:
- Experience with technologies such as Docker, Consul, Vault, Jenkins, Concourse, Prometheus, Nexus
- Experience with encryption technologies such as GoPass, ACM, KMS, Hashing
- Experience with other cloud providers such as Google GCP or Microsoft Azure
- Experience with Java or other JVM based development languages