Key responsibilities
As a Site Reliability Engineer within Advanced Analytics (DA3) in the Chief Data AI Office at Allianz Partners, you will join the platform engineering team to own the reliability and operational health of the central engineering platform.
You will define and maintain service level objectives, drive incident response at the infrastructure layer, and systematically eliminate operational toil through automation.
You will work closely with Platform Engineers, Security Engineers, and incident-response leads to ensure the platform meets its reliability commitments across production workloads spanning AI services, Java APIs, and frontend applications.
Through this role, you will have the main following responsibilities:
- Define, instrument, and maintain SLOs and SLIs for platform components; own error budget tracking and produce regular reliability reports for senior leadership.
- Serve on the on-call rotation as the infrastructure escalation tier; lead incident response for cluster-level, network-level, and storage failures; chair blameless post-incident reviews.
- Implement and operate Kubernetes infrastructure (AKS): cluster lifecycle management, networking, resource quotas, autoscaling configuration, and multi-tenancy patterns across product team namespaces.
- Develop Infrastructure as Code (Terraform) to provision and manage Azure resources with consistency, auditability, and repeatable rollback capability.
- Build and maintain observability infrastructure: Prometheus, Grafana, Azure Monitor, and Application Insights; own alerting rules, dashboards, and distributed tracing coverage across platform components.
- Perform capacity planning and cost-aware resource management: right-size node pools, tune vertical and horizontal pod autoscalers, and identify resource waste across namespaces.
- Identify and eliminate toil: automate repetitive operational tasks through scripting and tooling; measure and track toil reduction over time.
- Maintain platform reliability procedures: rolling upgrades, backup and recovery testing, disaster recovery runbooks, and change freeze coordination.
- Contribute to CI/CD pipelines and GitOps tooling (GitHub Actions, ArgoCD) from a reliability and deployment safety perspective; work with platform engineering on release gates and rollback mechanisms.
- Collaborate with incident-response leads on incident SLA targets and operational procedures; work with Security Engineers on infrastructure hardening and vulnerability remediation.
What you bring
- 5+ years professional experience in site reliability engineering, DevOps, or platform engineering roles.
- Strong Kubernetes experience: cluster operations, networking (Ingress, network policies), storage, autoscaling, and hands-on troubleshooting across production environments.
- Solid Infrastructure as Code experience with Terraform; familiarity with Bicep or ARM templates is a plus.
- Production experience with Azure cloud services: AKS, ACR, Key Vault, Azure Monitor, Application Insights, Virtual Networks, and Private Endpoints.
- Strong observability experience: Prometheus, Grafana, centralized logging, alerting configuration, and distributed tracing instrumentation.
- Working knowledge of SLO/SLI methodology: error budget principles, reliability target setting, and capacity planning.
- Structured incident management experience: on-call ownership, blameless post-incident review, and runbook authorship.
- Scripting and automation proficiency in Python or bash for toil elimination and operational tooling.
- Strong CI/CD experience: GitHub Actions and ArgoCD or equivalent GitOps tooling.
Ways of Working
- Comfortable in agile, iterative delivery environments with personal ownership and accountability for platform reliability.
- Clear communicator across global, cross-functional stakeholders; able to translate technical reliability metrics into business impact for non-technical audiences.
- Proactive learner with pragmatic adoption of AI-assisted developer tools (e.g., GitHub Copilot, Claude Code) to improve automation coverage and delivery velocity.
Nice to Have
- Kubernetes certifications: CKA or CKAD.
- Experience supporting AI or ML infrastructure workloads: GPU scheduling, model serving platforms, or inference pipeline operations.
- Exposure to chaos engineering practices and fault injection testing.
- FinOps experience: reserved capacity planning, resource right-sizing programs, and cost attribution per team or workload.
- Service mesh experience (Istio, Linkerd) for traffic management and reliability patterns.
- Experience in regulated industries (insurance, finance, healthcare) where auditability, change traceability, and secure-by-default operations are standard practice.
How We HireAllianz Partners does not accept unsolicited CV’s or approaches from agencies. We only work with partners on our approved supplier lists, under contract. Any unsolicited submission will not be considered.
What we offer
Our employees play an integral part in our success as a business. We appreciate that each of our employees are unique and have unique needs, ambitions and we enjoy being a part of their journey. We are there to empower and encourage you with your personal and professional development ensuring that you take control by offering a large variety of courses and targeted development programs.
All that in a global environment where international mobility and career progression are encouraged. Caring for your health and wellbeing is key priority for us. This is why we build Work Well programs to providing you with peace of mind and give the flexibility in planning and arranging for a better work-life balance.
90377 | Data AI | Professional | Allianz Partners | Full-Time | Permanent
Allianz Group is one of the most trusted insurance and asset management companies in the world. Caring for our employees, their ambitions, dreams and challenges, is what makes us a unique employer. Together we can build an environment where everyone feels empowered and has the confidence to explore, to grow and to shape a better future for our customers and the world around us. At Allianz, we stand for unity: we believe that a united world is a more prosperous world, and we are dedicated to consistently advocating for equal opportunities for all. And the foundation for this is our inclusive workplace, where people and performance both matter, and nurtures a culture grounded in integrity, fairness, inclusion and trust. We therefore welcome applications regardless of ethnicity or cultural background, age, gender, nationality, religion, social class, disability or sexual orientation, or any other characteristics protected under applicable local laws and regulations. Join us. Let's care for tomorrow.