Hardware/Server Engineer
Job Description
Key Responsibilities:
· Hardware Integration &Standardized Delivery: Responsible for the rack installation and physical connection of GPU servers, liquid-cooled cabinets, and network equipment. Strictly adhere to hardware installation standards to ensure operational safety and compliance.
· Hardware Diagnostics &Maintenance: Troubleshoot and maintain core server components (motherboard, CPU, GPU, memory, HDD, PSU). Monitor hardware health status to promptly identify and mitigate potential risks.
· Liquid Cooling System Support: Collaborate with Facility Management engineers for liquid cooling system inspections and fault handling. Possess rapid response capabilities for emergencies such as leaks or blockages.
· System-Level Troubleshooting: Conduct in-depth fault diagnosis based on Linux/Unix systems, produce professional hardware failure analysis reports, and provide improvement recommendations.
Qualifications:
· Skills: Proficient in Linux/UNIX systems and Shell/Python scripting, with the ability to in dependently troubleshoot system-level issues.
· Experience: 3+ years of experience in data center hardware installation, testing, and maintenance. Experience with high-performance GPU servers (e.g., NVIDIA A100/H100/B300) or HPC cluster maintenance is preferred.
· Certifications: NVIDIA Certified Engineer (NCE) hardware certification is preferred.
· Liquid Cooling: Familiarity with liquid cooling principles; hands-on experience in disassembly, leak detection, and fluid replenishment for liquid-cooled servers is a plus.
· Project Background: Prior participation in the construction and delivery of large-scale AI Computing Centers (AICC) or Supercomputing Centers is preferred.