Site Reliability Engineer (Edge Services), Infrastructure Services
Job Description
About the Role
We are seeking a proactive Site Reliability Engineer to champion the evolution of our production ecosystems. In this role, you will help drive the vision for our visibility, moving beyond simple uptime metrics to build a sophisticated, data-driven reliability framework. You will play a pivotal role in ensuring our services are resilient, scalable, and observable, bridging the gap between complex distributed systems and seamless user experiences.
Description
As a key member of the SRE team, your mission is to treat operations as a software problem. You will focus on designing and implementing a next-generation observability and alerting strategy that prioritizes high-cardinality data and meaningful signals over noise. You will spend your time building \\
Minimum Qualifications
- Understanding of Linux internals and deep networking expertise, including HTTP/2, HTTP/3 (QUIC), and HTTPS/TLS. You should be comfortable debugging protocol-level issues and optimizing traffic flow.
- Proven ability to automate repetitive tasks and complex workflows using Python or Go
- Experience configuring and managing modern monitoring suites (e.g., Prometheus, Grafana, ClickHouse) with a focus on creating actionable, high-signal quality alerting.
- Grasp of Data Structures and Algorithms (DSA) to write efficient, performant code and troubleshoot complex system bottlenecks.
- Practical knowledge of SLIs, SLOs, Error Budgets, Release Management and Incident Management to drive engineering priorities.
Preferred Qualifications
- Experience managing cloud environments (AWS, GCP, or Azure) using Terraform, Ansible, or Pulumi.
- Orchestration: Hands-on experience scaling and securing containerized workloads via Kubernetes.
- A track record of leading \\