Back to ByteDance jobs
B
Senior Systems Engineer – Server Provisioning & Deployment, DCS
Singapore
RegularOperationsJob Description
The Data Center Service team supports the company's fast growth by building and operating hyperscale data centers. The team manages the end to end lifecycle of server fleet, providing cloud solutions and various infrastructure services ensuring that they are scalable and are reliable.
Responsibilities:
- Large-Scale Server OS Deployment
- Responsible for operating system deployment and delivery across large-scale IDC environments.
- Perform OS image installation, system initialization, and customized OS provisioning for servers.
- Provisioning Platform Architecture Evolution
- Design, develop, and continuously enhance the core architecture of hyperscale automated server provisioning platforms.
- Drive platform scalability, reliability, and operational efficiency improvements.
- Low-Level Services & Hardware Enablement
- Develop and maintain core backend components of the provisioning system, including PXE services, OS image management, and related infrastructure.
- Support hardware enablement and compatibility for new server platforms and components.
- Complex Troubleshooting & AIOps Innovation
- Investigate and resolve complex issues across the end-to-end server delivery lifecycle.
- Explore and implement Large Language Models (LLMs) and AI Agent technologies for intelligent log analysis, root cause identification, automated troubleshooting, and self-healing systems.
- Engineering Efficiency & Security
- Build and optimize CI/CD pipelines for infrastructure changes.
- Strengthen lifecycle security compliance, risk mitigation, and disaster recovery capabilities.
- Hardware Validation & Delivery Assurance
- Coordinate end-to-end server hardware validation activities to ensure delivery quality and compliance requirements are met.
- Performance Testing & Optimization
- Lead validation and testing of critical server components, including CPUs, memory, storage devices, and GPUs.
- Conduct single-node and cluster-level GPU performance benchmarking, stress testing, and performance tuning.
- Test Automation
- Develop automated benchmarking and stress-testing frameworks using scripting languages to improve testing efficiency and coverage.
- Quality Analytics & Continuous Improvement
- Perform quality analysis on large-scale server shipments.
- Drive quality control initiatives and manage closed-loop resolution of hardware and delivery issues.
Qualifications Minimum Qualifications
- Bachelor's degree in Computer Science, Engineering, or a related field, with 3+ years of experience in IT infrastructure, server operations, system engineering, or hardware validation.
- Strong understanding of data center infrastructure and operational models.
- Deep knowledge of Linux operating systems and server hardware architecture, including CPU, memory, storage, RAID, and network interface controllers.
- Solid understanding of PXE-based automated provisioning workflows and related network protocols such as DHCP, TFTP, and HTTP.
- Hands-on experience with out-of-band management technologies such as IPMI and Redfish, as well as boot architectures including BIOS and UEFI.
- Strong programming and automation skills in at least one of the following: Golang, Python, or Shell scripting.
- Familiarity with Git-based software development workflows and collaborative engineering practices, and proficient in Linux system administration and operational troubleshooting.
Preferred Qualifications
- Experience in server hardware validation, benchmarking, and quality assurance programs.
- Deep understanding of performance characteristics and benchmarking methodologies for CPUs, memory, storage systems, and GPUs.
- Experience designing and implementing automated performance and stress-testing frameworks.
- Proven ability to conduct large-scale quality analytics and operational excellence initiatives.
- Strong documentation skills, including technical specifications, test procedures, project reports, and operational runbooks.
- Strong analytical and problem-solving skills, with the ability to independently drive complex technical investigations and solutions.
- Experience leveraging AI-assisted development and troubleshooting tools to improve engineering productivity.