As a member of the Monitoring and Governance (MLOps) team, you’ll be directly responsible for the systems that enable customers to safely operate and monitor predictive and generative AI models spanning many different use cases. That includes detecting drift or accuracy issues in predictive models as well as providing mechanisms for monitoring and reacting to hallucinations in generative models. The overall goal is to reduce risk for our customers and help make their AI use cases successful. Key responsibilities * Adapting our MLOps offering to deal with new technical challenges in the predictive and particularly generative AI landscape * Seek, give, and receive critical feedback in a constructive manner, both formally and informally through code reviews, pair programming, and other ad-hoc collaboration * Maintaining and improving our existing tools, APIs, integrations and tests. * Operate and troubleshoot the services we maintain in Kubernetes clusters * Working directly with customers in workshops or escalations. * Lead your own projects with high autonomy through technical designs, implementation, and release to customers * Collaborate closely with product managers to distill requirements into technical tasks and drive technical feedback on potential solutions * Operate a 24/7 on-call rotation where team members take turns dealing with operational incidents and key customer escalations
Knowledge, Skills & Abilities * 4+ years of production experience writing Python applications * CKAD level understanding of Kubernetes * Basic understanding of Linux System Administration * Experience with database systems such as PostgreSQL, MongoDB or ElasticSearch * Basic understanding of OpenTelemetry Tracing, Logs and Metrics * Master’s in Computer Science (or related field) or equivalent professional experience * Ability to communicate clearly verbally and in writing in English on technical issues
Nice to Have * Publicly reviewable contributions to relevant open source projects * CKA or CKAD certifications * Prior experience with operation or monitoring of predictive or generative models * Production experience writing Go applications * Prior experience developing for the AWS, Azure, and/or Google Cloud platforms