JobHire
face icon
Register to automatically apply for this and similar jobs
Registration
star

Senior Site Reliability Engineer

XperiencOps Inc

Pleasanton, california


Job Details

Full-time


Full Job Description

The Senior Site Reliability Engineer (SRE) plays a vital role in ensuring the reliability, scalability, and performance of our enterprise software platform. This is a senior-level position that requires deep technical expertise, strong problem-solving skills, and the ability to collaborate effectively in a fast-paced, demanding environment. Our customers, the largest enterprises in the world, expect 24/7 platform availability and top-tier performance.

The ideal candidate has strong expertise in AWS cloud technologies, a deep understanding of serverless architectures (AWS Lambda), and a passion for building resilient systems to enhance the customer experience.

Platform Reliability:

  • Design, implement, and manage highly available and scalable systems to meet customer expectations for 24/7 uptime.
  • Monitor, troubleshoot, and resolve platform incidents using tools such as Sentry, New Relic, and custom monitoring frameworks.
  • Lead post-incident reviews to ensure root cause analysis and preventative measures are in place.

Automation and Optimization:

  • Develop and maintain automation for infrastructure management, monitoring, and incident response.
  • Optimize platform performance and scalability, proactively identifying and addressing bottlenecks.
  • Contribute to the development of CI/CD pipelines to improve deployment reliability and speed.

Collaboration:

  • Partner with L2 engineers to resolve complex customer issues, providing guidance and technical expertise as needed.
  • Work closely with product engineering to ensure platform improvements align with customer needs.
  • Actively contribute to the documentation and sharing of best practices to improve team performance and customer outcomes.

Leadership:

  • Mentor junior engineers and provide technical leadership in reliability engineering.
  • Drive cross-functional initiatives to improve platform stability and customer satisfaction.

Requirements

  • Bachelor's degree in Computer Science or related discipline.
  • 8+ years in a Site Reliability Engineering or DevOps role, with experience supporting enterprise-grade software platforms.
  • 3+ years of experience in cloud services, in particular AWS.
  • Experience building observability systems on New Relic, Cloudwatch or similar.
  • Experience implementing rate-limiting, API gateways, and load balancing for highly available systems.
  • Exposure to security best practices and compliance frameworks (e.g., SOC2, ISO27001).
  • Proficient in infrastructure as code (IaC) using tools such as Terraform or CloudFormation.
  • Hands-on experience with scripting and programming languages like Python, Go, or Bash.
  • Strong troubleshooting and debugging skills.
  • Excellent communication and collaboration skills.
  • Experience with incident management and post-mortem practices.
  • Soft Skills:
    • Exceptional problem-solving and critical thinking abilities.
    • Strong verbal and written communication skills, with the ability to navigate ambiguity and provide clarity.
    • Ability to work collaboratively in cross-functional teams under pressure.

Key Attributes:

  • Reliability-Driven: Strong commitment to platform reliability and performance.
  • Leadership and Mentorship: Willingness to guide and mentor less experienced team members.
  • Customer-Focused: Dedication to meeting and exceeding customer expectations in a high-pressure environment.

Expectations:

  • Availability to participate in a 24/7 on-call rotation.
  • Ability to work in a fast-paced, ambiguous environment with rapidly changing priorities.
  • Proactive approach to identifying and mitigating risks before they impact customers.
  • Strong sense of accountability and ownership for platform stability and customer satisfaction.

Benefits

  • Opportunity to work on cutting-edge products and make a real impact.
  • Collaborative and fast-paced work environment.
  • Chance to be part of a rapidly growing startup.
  • Competitive salary and benefits package (health insurance, dental insurance, vision insurance, paid time off, etc.)

Get 10x more interviews and get hired faster.

JobHire.AI is the first-ever AI-powered job search automation platformthat finds and applies to relevant job openings until you're hired.

Registration