Our customer is a U.S.-based data center operations company that runs a DC monitoring platform for tracking energy efficiency, uptime compliance, and anomaly detection across infrastructure. The team is building an agentic AI layer on top of their existing data pipeline — and we’re looking for an ML/AI Engineer to own this stream end-to-end.
Requirements:
● 5+ years of commercial experience as an ML or AI Engineer ● strong Python 3.11 skills — production ML serving code, async handlers ● experience with scikit-learn: Isolation Forest, model serialization, scoring threshold selection ● hands-on experience with time-series analysis: stationarity, rolling statistics, anomaly baseline establishment ● experience building agentic workflows with LangGraph: stateful agent graph design, tool node definition, memory integration ● experience with Anthropic API (Claude) or similar LLM providers: tool_use message format, system prompt engineering, retry handling ● experience with Azure Functions: Python handler authorship, cold-start optimization ● familiarity with Snowflake SQL API: parameterized query execution from Python ● experience with MLflow or equivalent: experiment tracking, model registry, artifact versioning ● prompt engineering skills: system prompt design, tool description optimization, human-in-the-loop flow design ● english level: Upper Intermediate or higher
Would be a plus:
● experience with predictive maintenance in industrial or data center environments ● prior production deployment of LLM-based agents with tool use and human oversight controls ● familiarity with alternative anomaly detection approaches: LSTM autoencoders, statistical process control
Responsibilities:
● train and deploy an Isolation Forest anomaly detection model on UPS load and CRAC delta-T time series data ● deploy model inference as an Azure Function (Python 3.11, scikit-learn) with latency SLA compliance ● version all model artifacts in Azure Blob Storage with rollback support ● design and implement a LangGraph multi-tool agent with tools for querying Snowflake, sending ops alerts, generating reports, and retrieving baselines ● implement short-term (in-process) and long-term (Snowflake memory table) memory for the agent ● build a human-in-the-loop approval step for all agent write actions with audit logging ● develop scheduled agents (cron): daily PUE optimization report, weekly Uptime Institute compliance check, monthly energy summary ● set up event-triggered agents: anomaly webhook, alarm escalation chain, root cause analysis report ● version all agent prompts, tool definitions, and LangGraph workflow graphs in Azure DevOps ● communicate directly with the client to clarify requirements and present progress
Why Rolique?
● we believe in fairness, transparency and helpfulness in everyday work ● your personal development is important to us — we promote internal knowledge transfer and strengthen your “zone of genius” ● 20 days of paid vacation and 5 days of sick leave ● personal budget for courses, training, and certifications ● health support and sports compensation ● accounting support