Imagine a future where everyone has instant, low-cost access to intelligence. We’re building a fully featured European AI cloud - with everything one needs to train, experiment with, and deploy AI models. In addition, our GPUs run on 100% renewable energy.
We’re ambitious, curious, and gutsy doers. We practice a low hierarchy across the company and high morale in our teams. We’ve already achieved a lot, yet we’re only getting started. Now it’s your chance to join the ride. We offer more than just the job - we offer a career-defining opportunity to be part of building something big!
Join Verda while it’s still being built - not once it’s finished.
Cash + equity compensation along with various fringe benefits (e.g., healthcare, lunch, wellbeing, etc.).
Profitable operations with rapid, sustained growth.
31 nationalities, with 6 different ones on the management team.
An opportunity with make a clear impact and work alongside world-class engineers, researchers, and partners across the global AI ecosystem.
Verda's customers run AI workloads that cannot afford to go down. Behind every SLA we sign is a data center that has to deliver it around the clock, every day of the year. We are looking for a Data Center Operations & Reliability Manager to own that promise.
You will be accountable for the operational reliability of our data center sites: committing to and following up on our SLAs, tracking and mitigating equipment downtime, running the 24/7 shift coverage of our support engineers, enforcing safety and security guidelines, and owning the incident reporting loop from first alert to closed follow-up.
Own SLA commitments and performance. Define, monitor, and report on service levels, and drive corrective action when targets are at risk.
Track equipment downtime across sites, analyze failure patterns, and lead mitigation: root cause analysis, preventive measures, and escalation with vendors where needed.
Plan and manage 24/7 shift schedules for support engineers, ensuring continuous coverage, fair rotation, and adequate staffing for planned maintenance and peak periods.
Enforce and continuously improve Safety & Security guidelines — ensuring all on-site work follows established protocols and compliance requirements.
Oversee incident reports end-to-end: ensure incidents are documented, communicated, followed up, and closed with root cause and prevention actions.
Report regularly to management on reliability metrics, incident trends, and operational risks.
5+ years of experience in data center operations, critical facilities, or mission-critical infrastructure environments.
Proven experience managing or scheduling teams in a 24/7 shift-based operation.
Hands-on understanding of data center infrastructure: power, cooling, networking and common failure modes.
Experience with SLA management and operational reporting in a customer-facing infrastructure business.
Strong incident management skills: structured response, root cause analysis, and disciplined follow-up.
Familiarity with safety and security protocols in critical environments.
Strong written and verbal English.
Strong Plus
Experience in GPU, HPC, or hyperscale cloud environments, including high-density racks and liquid cooling.
Experience with monitoring, ticketing, and maintenance management systems (e.g., DCIM, CMMS).
Data center certifications such as CDCP or equivalent.
Experience building reliability processes from scratch in a fast-growing company.
We're building fast and this role needs the right person behind it. There's no artificial deadline, but when we find who we're looking for, we move. If this sounds like your next move, apply now.
Please submit your application through our Careers page. We don’t accept applications sent by email.