C++ Engineer — AI Runtime (Stealth Startup)About Us We are a stealth-mode startup building next-generation infrastructure for the AI industry. Our team has decades of experience in software, systems, and deep tech. We are working on a new kind of AI runtime that pushes the boundaries of performance and flexibility — making advanced models portable, efficient, and customizable for real-world deployment.
If you want to be part of a small, fast-moving team shaping the future of applied AI systems, this is your opportunity. Role We are looking for a C++ Engineer with strong systems and GPU programming background to help extend and optimize an open-source AI inference runtime. You will work on low-level internals of large language model serving, focusing on: * Dynamic adapter integration (e.g., LoRA/QLoRA) * Incremental model update mechanisms * Multi-session inference caching and scheduling * GPU performance improvements (Tensor Cores, CUDA/ROCm)
This is a hands-on role: you will be designing, coding, profiling, and iterating on high-performance inference code that runs directly on CPUs and GPUs. Responsibilities * Implement support for runtime adapter loading (LoRA), enabling models to be customized on the fly without retraining or model merges. * Design and implement mechanisms for incremental model deltas, allowing models to be extended and updated efficiently. * Extend runtime to handle multi-session execution, with isolation and caching strategies for concurrent users. * Optimize core math kernels and memory layouts to improve inference performance on CPU and GPU backends. * Collaborate with backend and infrastructure engineers to integrate your work into APIs and orchestration layers. * Write benchmarks, unit tests, and profiling tools to ensure correctness and measure performance gains. * Contribute to system architecture discussions and help define the roadmap for future runtime features.
Requirements * Strong proficiency in modern C++ (C++14/17/20) and systems programming. * Solid understanding of low-level performance optimization: memory management, multithreading, SIMD, cache efficiency. * Experience with CUDA and/or ROCm/HIP GPU programming. * Familiarity with linear algebra kernels (matrix multiply, attention) and how they map to hardware acceleration (Tensor Cores, BLAS libraries, etc.). * Exposure to machine learning inference frameworks (e.g., llama.cpp, TensorRT, ONNX Runtime, TVM, PyTorch internals) is a plus. * Comfortable working in a Unix/Linux environment; experience with build systems (CMake, Bazel) and CI pipelines. * Strong problem-solving and debugging skills; ability to dive deep into both code and performance traces. * Self-motivated and able to thrive in a fast-moving startup environment.
Nice to Have * Experience implementing LoRA or adapter-based fine-tuning in inference runtimes. * Knowledge of quantization methods and deploying quantized models efficiently. * Background in distributed systems or multi-GPU orchestration. * Contributions to open-source ML/AI systems.
Why Join * Build core IP at the intersection of AI and systems engineering. * Work with a highly technical founding team on problems that are both intellectually challenging and commercially impactful. * Opportunity to shape the direction of a new AI platform from the ground up. * Competitive compensation (contract or full-time), equity potential, and flexible remote work.