For the truly ambitious data engineer. Dive into research papers, multimodal systems, production deployment patterns, evaluation frameworks, and reasoning models like DeepSeek-R1.

Month 1: Deep technical learning

Seminal papers: the ones to actually read

  • Scaling LLM Test-Time Compute (OpenAI's o1 approach) — introduced the reasoning model paradigm that defined 2025
  • The Prompt Report — the most comprehensive survey of prompting techniques; still the reference
  • BitNet b1.58 — seminal for the 1-bit quantization and on-device efficiency direction

For everything published since, Sebastian Raschka maintains the best curated reading list: 2025 H1 and 2025 H2, organized by reasoning, inference-time scaling, architectures, and training efficiency. The Latent.Space 2025 AI Engineering Reading List is a good practitioner-oriented companion.

Don't just read. Implement. GraphRAG from Microsoft is a great starting point. BitNet 1.58-bit quantization is worth exploring (1-bit models that actually work).

Multimodal systems

NVLM and Gemini 2.0 architectures are where things are heading. Build an image + text RAG system. Add voice if you're feeling adventurous.

Fast.ai's Part 2 course has you implementing Stable Diffusion from scratch. Challenging but educational. Also check out Karpathy's "Neural Networks: Zero to Hero" course.

Month 2: Production excellence

Production deployment patterns

NirDiamant's agents-towards-production repo covers everything: Docker, FastAPI, GPU scaling, security. This is the difference between a demo and something that actually ships.

Evaluation and testing frameworks

OpenAI's Evals framework is the gold standard. DeepEval and LM-Evaluation-Harness for comprehensive testing. For adversarial and security testing, PromptFoo has become the standard: YAML-configured prompt A/B testing and red-teaming in one tool. If you need rigorous, reproducible offline evaluation where auditability matters, Inspect from the UK AI Safety Institute is worth knowing. Set up red teaming before someone else does it for you.

Designing agents that know when not to act

By Month 2 you've wired up tool-calling, retrieval, and code generation. The gap between that and something you'd run against a production dbt project is not more tooling. It's a decision layer that doesn't exist in most frameworks out of the box.

The architecture that closes that gap is a three-step decision loop: Understand → Evaluate → Execute.

Understand means the agent retrieves structured context before executing. Specifically: decision history (why was this model built this way?), lineage (what does it touch downstream?), and cost attribution (what did the last run cost, and what will this change cost?). An agent without this step reasons from the SQL alone. An agent with it reasons from your data estate's actual history.

Evaluate means the agent runs two checks before acting: a confidence score against its retrieved context, and a blast radius calculation: how many downstream models, dashboards, and pipelines does this change touch? Those two numbers, combined, determine whether the agent proceeds, scopes down, or routes decisions to human review. The threshold is a design decision you make explicitly. Most teams discover during Month 2 that human-in-the-loop isn't a degraded mode. It's what makes the system trustworthy enough to automate more over time.

Execute means the agent acts within the scope the Evaluate step established, then writes the outcome back to the context layer as a structured decision trace. That trace becomes input to the next Understand step. The compound effect is real: an agent that has seen your team approve or reject 200 similar changes will calibrate its confidence scores differently than one seeing your data estate for the first time.

Most frameworks give you the tools to call functions and run queries. They don't give you the decision logic: what to check before acting, what counts as too risky to proceed autonomously, and how to capture the outcome so the next run starts smarter. That's the layer you have to build. The Understand → Evaluate → Execute loop is one concrete implementation of it, and the reason it's baked into every agent on the Altimate platform.

Month 3: Pushing boundaries

Reasoning models deep dive

In 2024, everyone was trying to replicate o1. In 2026, reasoning models are a mature category with distinct tradeoffs: o3, Gemini 2.5 Pro, DeepSeek-R1 and its distilled variants, Qwen3's hybrid thinking/non-thinking modes. The question has shifted from "can we build this?" to "when does the compute cost of chain-of-thought justify itself, and when does a smaller direct model win?" The DeepSeek-R1 paper remains the clearest technical account of how reinforcement learning produces emergent reasoning. Sebastian Raschka's Understanding Reasoning LLMs is the best single primer on the category: what they are, how they're trained, and where they break down.

Capstone projects that matter

  1. AI-First Data Platform — Redesign your entire data architecture with AI at the center
  2. Domain Expert System — Fine-tune multiple models for your industry, build evaluation frameworks
  3. Open Source Contribution — Pick a major repo and contribute. Instant credibility.