Act as the primary technical contact (SPOC) for customers, providing expert support across HPC (High Performance Computing) and AI platforms while building strong long-term customer relationships.
Monitor and maintain critical data center infrastructure, including servers, storage, networking, operating systems, cluster management software, power, cooling, and environmental systems to ensure reliability, availability, and security.
Troubleshoot and resolve hardware, software, operating system, firmware, network, and infrastructure issues, coordinating with vendors and internal teams to quickly minimize downtime and service disruptions.
Manage daily system administration activities, including user access management, software deployment, OS upgrades, firmware updates, security patching, defect resolution, and ongoing platform maintenance.
Take ownership of incident and problem management by responding to system alerts, monitoring performance and system health, identifying root causes, opening vendor support tickets, and tracking issues through resolution.
Support the installation, configuration, testing, upgrade, and optimization of HPC and AI environments, helping customers continuously improve their research computing capabilities and platform performance.
Work closely with customer technology, infrastructure, security, governance, and vendor teams to deliver projects, implement controlled changes, maintain documentation, ensure compliance, and support audits and service reviews.
Provide technical guidance, knowledge transfer, and training to customers and researchers, contribute to documentation and best practices, participate in projects and customer meetings, and help maximize the value of HPC resources through outstanding customer support and analytical problem-solving skills.