Site Reliability Engineer (SRE) We are looking for a Middle Site Reliability Engineer (SRE) to join our team and help to maintain, scale, and improve the reliability of modern cloud-native platforms. This role is ideal for an engineer with solid production experience in Kubernetes and distributed systems who enjoys solving complex reliability, performance, and operational challenges.
As an SRE, your primary focus will be ensuring platform stability, availability, and operational excellence across customer environments. You will work closely with a Tech Lead and cross-functional teams to improve system resilience, strengthen incident response processes, and continuously enhance infrastructure reliability, security, and performance.
A key part of this role is active participation in incident management, including on-call, investigation of P1/P2 incidents, root cause analysis, and implementation of long-term preventive measures.
Responsibilities: * Be responsible for platform stability and ad hoc implementations on the projects you are assigned to. * Continuously evaluate infrastructure and propose improvements in: Security, Performance, Cost optimization, (with support from a Tech Lead). Participate in on-call for critical P1/P2 incidents * Prepare runbooks and a knowledge base for L2 engineers * Handle day-to-day operational and infrastructure tasks raised by customers
Experience & Qualification:
Cloud & Infrastructure: * Strong hands-on experience with AWS (minimum 3 years). * Solid production experience with Kubernetes (minimum 2+ years) * Experience with Amazon EKS (preferred) or other managed Kubernetes platforms * Creating and managing clusters * Performing cluster upgrades * Troubleshooting production workloads * Infrastructure as Code using Terraform * Strong Linux knowledge
Kubernetes & Container Ecosystem: * Deep understanding of Kubernetes architecture (API Server, Scheduler, Controller Manager, etcd, CNI, CSI) * Strong hands-on experience troubleshooting Kubernetes workloads (pods, nodes, networking, DNS, storage, scheduling) * Proven ability to debug production issues (CrashLoopBackOff, OOMKilled, Pending pods, failing probes, resource pressure) * Troubleshooting service-to-service communication issues (ClusterIP, NodePort, Ingress, LoadBalancer) * Experience operating workloads on managed Kubernetes platforms (e.g., EKS) * Experience managing and deploying workloads using Helm (chart development, templating, dependency management, values structuring) * Hands-on experience with: HPA (Horizontal Pod Autoscaler), Karpenter, KEDA (would be a plus) * Understanding of scaling strategies, limits, and cost implications * Experience working with GitOps workflows * Familiarity with ArgoCD — nice to have * Understanding of deployment strategies (rolling, blue/green, canary) * Experience operating stateful workloads inside Kubernetes * Understanding of Persistent Volumes, StorageClasses, and CSI drivers * Experience working with databases deployed in Kubernetes (PostgreSQL, MySQL, Redis, or other stateful services)
Databases & Data Layer: * Experience managing AWS databases: RDS (MySQL, PostgreSQL, etc.), DynamoDB. * Understanding of: * Backup and restore strategies * Scaling (read replicas, autoscaling, provisioned vs on-demand capacity) * Monitoring database health and performance * Solid knowledge of SQL queries, indexing, query optimization, and execution plans * Database configuration and tuning (connections, memory, storage engines, replication settings, parameter groups, etc.) * Experience managing databases running inside Kubernetes (stateful workloads) * Understanding of networking and security between applications and databases
Observability & Monitoring: * Setting up and maintaining an observability stack: * Prometheus * Grafana * Alloy (or similar collectors) * Victoria metrics would be a plus * Configuring alerting * Working with CloudWatch metrics and logs * Using SNS + Lambda for alert automation and routing * Monitoring Kubernetes clusters and databases
CI/CD: GitHub Actions; GitLab CI is a plus; implementing security scanning (e.g., Trivy) in CI pipelines.
What we offer: * Paid vacation — 16 days per year; * Documented (16 days) and undocumented (5 days) sick leave; * Leave for significant life events; * Compensation for medical insurance (for our employees based in UA); * Flexible working hours (start at 8.00-11.00 o’clock); * Weekends according to the Ukrainian/Local calendar; * Engaging projects with opportunities for career growth; * Certifications, Udemy courses; * Corporate events, team-building activities, * Gifts and recognition from the company; * A friendly, open, and supportive team culture without excessive bureaucracy.