Our client is one of the leading healthcare solution providers in the United States, collaborating with over 10,000 organizations and serving more than 80 million patients.
We are building an analytical system that enables users to generate reports on clinical measures used in regulatory reporting programs. This is a high-impact product that directly contributes to improving healthcare outcomes nationwide.
We are looking for a Middle Data Engineer who is ready to join an Agile team, work with modern AWS-based data processing tools, and contribute to a system with real-world influence. Responsibilities * Deliver data transformations in PySpark (on EMR Serverless): design new transformers and enhance existing ones; confidently use DataFrame API, joins, window functions, broadcast, and caching. * Work within an event-driven architecture: understand producers/consumers, queues, idempotency, retries, and failure handling. * Apply the testing pyramid correctly: understand the purpose and differences of unit, component, integration, and system tests. * Add and maintain automated data-driven tests (e.g., using pytest) to validate existing and new functionality. * Proactively improve pipelines, testing practices, and team processes — without waiting for external prompts. * Present technical results clearly to both internal and client stakeholders. * Communicate directly with the client when needed. * Collaborate actively with the team: participate in code reviews, knowledge sharing, and mutual support.
Required Skills & Tech * Strong experience with PySpark 3.9+ * AWS: Lambda, S3, SQS, EMR (EMR Serverless), Step Functions * Docker * pytest (including data-driven test patterns) * Solid Agile teamwork experience, ownership mindset, and proactive contribution * Clear spoken and written English for communication with the client and conducting demos
Nice to Have * Spark performance tuning, Structured Streaming * Advanced Iceberg features: time travel, partition evolution * AWS Athena; observability using CloudWatch * SQL optimization; data modeling in Lakehouse architecture * Ability to build and improve CI/CD pipelines (Jenkins) for code, data, and infrastructure * Experience defining and evolving infrastructure with Terraform (modules, remote state, dependency ordering)
Why Join the Project? * A meaningful product that impacts healthcare quality across the US * Modern tech stack: PySpark, AWS Serverless, Iceberg, Terraform * A mature Agile team with strong engineering culture * Direct communication with the client and ability to influence the product * Continuous improvement mindset across the team