Senior Site Reliability Engineer I
Talkdesk
At Talkdesk, we are courageous innovators focused on redefining the customer experience, making the impossible possible for companies globally. We champion an inclusive and diverse culture representative of the communities in which we live and serve. And, we give back to our community by volunteering our time, supporting non-profits, and minimizing our global footprint. Each day, thousands of employees, customers, and partners all over the world trust Talkdesk to deliver a better way to great experiences.
We are recognized as a cloud contact center leader by many of the most influential research organizations, including Gartner and Forrester. With $498 million in total funding, a valuation of more than $10 Billion, and a ranking of #16 on the Forbes Cloud 100 list, now is the time to be part of the Talkdesk legacy to help accelerate our success in a new decade of transformational growth.
At Talkdesk, we embrace FAST, our fundamental operating principles that define who we are as an organization. These principles drive us to make the impossible possible. FAST: Focus + Accountability + Speed = Talkdesker.
- Focus: Focus time, energy and attention on what is most impactful for the business and thoughtful about how and when to partner with others.
- Accountability: Hold self and others accountable to meet commitments and drive results. Accept responsibility for successes and failures.
- Speed: Execute with agility and urgency. Act promptly, decisively, and without delay. Make good and timely decisions that keep the organization moving forward.
- Talkdesker: YOU!
Our mission is to improve developers’ experience by giving them the tools to manage the entire software lifecycle and to be self-sufficient.
To help with this we are building our own internal PaaS using the latest technologies like Kubernetes, Prometheus, Kotlin and others. This platform is an important pillar in Talkdesk’s engineering effort and helps us deliver better, faster and more reliable solutions for our customers.
Responsibilities:
- Design, build, harden, and maintain key infrastructure parts of our platform (from the lifecycle of the infrastructure to each one of our Kubernetes clusters)
- Support the processes that enable the safe upgrade and update of each component of our compute infrastructure
- Work with GitOps industry-leading tools such as Spacelift and/or Atlantis
- Help automate safe deployment practices by using industry-leading tools such as GitHub Actions, ArgoCD, Argo Rollouts, Helm Charts, etc
- Help automate infrastructure provisioning and other engineering processes by working on automations built on top of an engineering platform written in GitHub Actions
- Coach and up-skill other engineering team members
- Solve challenging technical problems and put your skills to the test every day; see an immediate impact of your work and the value you've created for other engineers
- Automate every aspect of our infrastructure to remove as much human intervention as possible
- Develop effective tooling, alerts, and responses to both identify and address reliability risks
- Drive and promote protocols on production readiness and operational excellence
- Partner with product engineering teams to debug production outages and carry out action items to improve the reliability of those systems
- Advocate for automated testing, continuous integration and delivery, feature toggles, and progressive rollouts
- Plan for the growth of Talkdesk’s infrastructure.
Skills and Qualifications:
- Understand large-scale complex systems from a reliability perspective
- Passion for producing clean, standards-compliant, secure code
- Bringing a developer mindset and applying it to infrastructure
- Know your way around Linux/Unix systems
- Experience with Kubernetes
- Experience with Infrastructure as code tools like Terraform and Ansible
- Experience building software with a programming language such as Java, Kotlin, Scala, or any other JVM-based languages
- Experience writing scripts for automating the execution of certain tasks with a programming language like Ruby, Python, Bash, or any other scripting language
- Experience with at least one relational and non-relational databases (ex.: PostgreSQL, MySQL, MongoDB, Redis, ElasticSearch)
- Ability to identify time-consuming and error-prone manual tasks and then build/leverage tooling to automate them
- Ability to identify root causes of instability in a large-scale distributed system across stacks
Nice to haves / Pluses:
- Experience with cloud-based solutions such as Amazon AWS, Google Cloud, or Microsoft Azure
- Experience with Go programming language
Additional Notes:
This position will follow a hybrid work model.
#LI-Hybrid
Work Environment and Physical Requirements:
Primarily office-environment work, extended periods of sitting or standing, computer-based work. Limited lifting, and equipment usage limited to computer-related equipment (keyboards, mouse, etc.)