Senior Site Reliability Engineer - Big Data Compute Platform
Hireio, Inc.
Los Angeles, california
Responsibilities:
- Lead a global SRE team for Data Platform, distributed across the US and Singapore. Responsible for the reliability of all major data warehouse products, services, and query engines, such as ClickHouse, Spark, Presto, Doris, etc.
- Uphold Service Level Agreements (SLAs): Ensure that all service level objectives and agreements from ByteDance's Data Platform services are met. Lead team members to respond promptly to any system outages or issues.
- Continuous Performance Optimization: Lead the team to deeply analyze service performance and reliability patterns to identify potential performance bottlenecks. Implement proactive measures to prevent service disruptions. Work with development teams to optimize application performance, ensuring that services run efficiently and those resources are utilized effectively.
- Incident Management: Build robust incident management mechanism. Lead efforts to troubleshoot and resolve...