Senior Site Reliability Engineer
XperiencOps Inc
Pleasanton, california
Job Details
Full-time
Full Job Description
The Senior Site Reliability Engineer (SRE) plays a vital role in ensuring the reliability, scalability, and performance of our enterprise software platform. This is a senior-level position that requires deep technical expertise, strong problem-solving skills, and the ability to collaborate effectively in a fast-paced, demanding environment. Our customers, the largest enterprises in the world, expect 24/7 platform availability and top-tier performance.
The ideal candidate has strong expertise in AWS cloud technologies, a deep understanding of serverless architectures (AWS Lambda), and a passion for building resilient systems to enhance the customer experience.
Platform Reliability:
- Design, implement, and manage highly available and scalable systems to meet customer expectations for 24/7 uptime.
- Monitor, troubleshoot, and resolve platform incidents using tools such as Sentry, New Relic, and custom monitoring frameworks.
- Lead post-incident reviews to ensure root cause analysis and preventative measures are in place.
Automation and Optimization:
- Develop and maintain automation for infrastructure management, monitoring, and incident response.
- Optimize platform performance and scalability, proactively identifying and addressing bottlenecks.
- Contribute to the development of CI/CD pipelines to improve deployment reliability and speed.
Collaboration:
- Partner with L2 engineers to resolve complex customer issues, providing guidance and technical expertise as needed.
- Work closely with product engineering to ensure platform improvements align with customer needs.
- Actively contribute to the documentation and sharing of best practices to improve team performance and customer outcomes.
Leadership:
- Mentor junior engineers and provide technical leadership in reliability engineering.
- Drive cross-functional initiatives to improve platform stability and customer satisfaction.
Requirements
- Bachelor's degree in Computer Science or related discipline.
- 8+ years in a Site Reliability Engineering or DevOps role, with experience supporting enterprise-grade software platforms.
- 3+ years of experience in cloud services, in particular AWS.
- Experience building observability systems on New Relic, Cloudwatch or similar.
- Experience implementing rate-limiting, API gateways, and load balancing for highly available systems.
- Exposure to security best practices and compliance frameworks (e.g., SOC2, ISO27001).
- Proficient in infrastructure as code (IaC) using tools such as Terraform or CloudFormation.
- Hands-on experience with scripting and programming languages like Python, Go, or Bash.
- Strong troubleshooting and debugging skills.
- Excellent communication and collaboration skills.
- Experience with incident management and post-mortem practices.
- Soft Skills:
- Exceptional problem-solving and critical thinking abilities.
- Strong verbal and written communication skills, with the ability to navigate ambiguity and provide clarity.
- Ability to work collaboratively in cross-functional teams under pressure.
Key Attributes:
- Reliability-Driven: Strong commitment to platform reliability and performance.
- Leadership and Mentorship: Willingness to guide and mentor less experienced team members.
- Customer-Focused: Dedication to meeting and exceeding customer expectations in a high-pressure environment.
Expectations:
- Availability to participate in a 24/7 on-call rotation.
- Ability to work in a fast-paced, ambiguous environment with rapidly changing priorities.
- Proactive approach to identifying and mitigating risks before they impact customers.
- Strong sense of accountability and ownership for platform stability and customer satisfaction.
Benefits
- Opportunity to work on cutting-edge products and make a real impact.
- Collaborative and fast-paced work environment.
- Chance to be part of a rapidly growing startup.
- Competitive salary and benefits package (health insurance, dental insurance, vision insurance, paid time off, etc.)