How do I set up Claude Code for a PyTorch machine learning project?

Create a CLAUDE.md at the project root specifying your Python version, CUDA version, PyTorch version, environment manager (uv, conda, venv), and ML stack (PyTorch Lightning vs vanilla, wandb vs MLflow for tracking). Set the test command to 'pytest tests/ -x -q' and the lint command to 'ruff check . && mypy src/'. Critical: specify your GPU setup — 'CUDA 12.4, RTX 4090, batch_size=32' — so Claude generates training code with correct device placement and doesn't write CPU-only code.

How do I use Claude Code for HuggingFace model fine-tuning?

Add a fine-tuning section to CLAUDE.md: 'Fine-tuning: PEFT/LoRA via the peft library. Base model: specify (e.g. mistralai/Mistral-7B-Instruct-v0.3). Training: SFTTrainer from trl. Dataset format: instruction/input/output JSON. Precision: bf16 on A100+; fp16 on V100. LoRA config: r=16, alpha=32, dropout=0.05, target_modules=["q_proj","v_proj"]. Push to HuggingFace Hub after training.' Claude will generate complete fine-tuning scripts including data preprocessing, trainer config, and evaluation.

Does Claude Code work well with Jupyter notebooks?

Claude Code operates on .py files more naturally than .ipynb files, but you can work effectively with notebooks by using nbconvert: add 'jupyter nbconvert --to script notebook.ipynb' to your CLAUDE.md commands. Claude can edit the converted .py file, then you convert back or use Jupytext to sync a .py mirror of the notebook. Alternatively, tell Claude to write all reusable code in src/ Python modules and import them in the notebook — this keeps Claude Code in its comfort zone (pure Python files) while keeping your analysis in Jupyter.

How do I track ML experiments with Claude Code using MLflow or W&B?

Add experiment tracking conventions to CLAUDE.md: 'Experiment tracking: MLflow (local: mlflow ui --port 5000) or Weights & Biases (wandb.init project=your-project). Log: hyperparameters at run start, metrics every epoch, final model artifact. Tag runs with: model_type, dataset_version, git_commit_hash. Never hardcode experiment names — read from config file or CLI args.' With these rules, Claude generates training scripts that automatically log to your tracker with consistent naming conventions.

Claude Code for AI/ML — PyTorch, LangChain & HuggingFace

Q: Can Claude Code help build LangChain RAG pipelines?

Yes. In CLAUDE.md, specify your RAG stack: 'Vector DB: Chroma (local) or Pinecone (prod). Embeddings: OpenAI text-embedding-3-small or HuggingFace BAAI/bge-small-en-v1.5. LLM: Claude claude-sonnet-4-6 via Anthropic API. Chunking: RecursiveCharacterTextSplitter, chunk_size=1000, overlap=200. Retrieval: MMR with k=6. Chain: ConversationalRetrievalChain with chat history.' With these specs, Claude generates consistent, production-ready RAG pipelines rather than toy examples.

AI and machine learning projects have unique Claude Code requirements: GPU-aware code generation, large dependency environments, experiment reproducibility, and frameworks like PyTorch, LangChain, and HuggingFace Transformers that evolve rapidly. Without detailed CLAUDE.md guidance, Claude defaults to outdated API patterns. This guide covers the CLAUDE.md template for ML projects, PyTorch training workflows, LangChain RAG pipeline generation, HuggingFace fine-tuning with PEFT/LoRA, Jupyter integration, and experiment tracking patterns.

ML Project CLAUDE.md Template

# Project: [Your ML Project Name]

## Environment
- Python 3.12
- CUDA 12.4 / cuDNN 9.x (or CPU-only — specify)
- GPU: NVIDIA A100 80GB (or RTX 4090, or CPU — specify)
- Package manager: uv (uv sync to install; uv run python script.py)
- Virtual env: .venv/ (managed by uv)

## Key commands
```bash
uv run pytest tests/ -x -q          # unit tests (fast)
uv run ruff check . --fix            # lint + autofix
uv run mypy src/ --ignore-missing-imports  # type checking
uv run python -m src.train           # training entry point
uv run jupyter lab                   # Jupyter Lab server
```

## ML Stack
- Deep learning: PyTorch 2.5 (torch, torchvision, torchaudio)
- Training framework: PyTorch Lightning 2.x OR vanilla PyTorch (specify)
- LLM orchestration: LangChain 0.3 / LangGraph (if applicable)
- Transformers: HuggingFace transformers 4.x, peft, trl, datasets, tokenizers
- Vector DB: ChromaDB (local) / Pinecone (production)
- Experiment tracking: Weights & Biases (wandb) OR MLflow (specify)
- Data: pandas 2.x, polars, numpy 2.x, scikit-learn
- Visualization: matplotlib, seaborn, plotly

## Conventions
- Device placement: device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
- Reproducibility: set seed via torch.manual_seed + numpy.random.seed + random.seed
- Data loading: torch.utils.data.DataLoader with num_workers=4, pin_memory=True (GPU)
- Mixed precision: torch.autocast("cuda", dtype=torch.bfloat16) on A100+
- Gradient accumulation: specify accumulation_steps if batch doesn't fit in VRAM
- Model checkpointing: save best val_loss checkpoint; load with strict=False for fine-tuning
- No global state: pass config dataclass/dict to functions; never rely on globals

## Project structure
- src/
  - data/        — dataset classes, preprocessing, augmentation
  - models/      — model architectures
  - training/    — training loops, evaluation, loss functions
  - inference/   — inference pipelines, serving
  - utils/       — metrics, logging, visualization helpers
- configs/       — YAML experiment configs (Hydra or plain YAML)
- notebooks/     — exploratory analysis (Jupytext-synced to scripts/)
- tests/         — unit tests for data processing and model components

Automated Test & Lint Hooks

// .claude/settings.json
{
  "allowedTools": ["Edit", "Write", "Bash", "Read", "Glob", "Grep"],
  "hooks": [
    {
      "event": "PostToolUse",
      "matcher": "Edit|Write",
      "hooks": [{
        "type": "command",
        "command": "cd $PROJECT_ROOT && uv run ruff check . 2>&1 | tail -15 && uv run pytest tests/ -x -q --tb=short 2>&1 | tail -20"
      }]
    }
  ]
}

For GPU code, test the forward pass and loss computation in unit tests with a tiny synthetic batch (batch_size=2, seq_len=16) so tests run on CPU in CI without needing a GPU.

PyTorch Training Workflows

Pattern	CLAUDE.md instruction
Training loop	"Training loop in src/training/trainer.py; separate train_epoch() and eval_epoch() functions; return metrics dict"
Mixed precision	"Use torch.autocast('cuda', dtype=torch.bfloat16) context manager + GradScaler for fp16; never wrap the entire model"
Gradient clipping	"torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) before optimizer.step()"
Checkpointing	"Save checkpoint dict: {'model': state_dict, 'optimizer': state_dict, 'epoch': n, 'val_loss': loss, 'config': cfg}"
Multi-GPU	"Wrap model in torch.nn.parallel.DistributedDataParallel; use torchrun for launch; DistributedSampler for DataLoader"
Dataset	"Extend torch.utils.data.Dataset with __len__ and __getitem__; apply augmentations only in training split"

New model training script

claude "add a training script for a text classification model.
Model: BERT-base-uncased fine-tuned for 3-class sentiment (positive/neutral/negative).
Dataset: CSV with 'text' and 'label' columns; split 80/10/10 train/val/test.
Training: AdamW optimizer, lr=2e-5, linear warmup 10%, cosine decay.
Mixed precision: bfloat16 on CUDA.
Log to W&B: epoch, train_loss, val_loss, val_accuracy, val_f1.
Save best checkpoint by val_f1.
Entry point: uv run python -m src.train --config configs/sentiment.yaml"

LangChain RAG Pipeline

# CLAUDE.md addition for LangChain/RAG

## LangChain / RAG stack
- LangChain version: 0.3 (use langchain-core, langchain-community packages)
- LLM: Anthropic Claude (langchain-anthropic) — model: claude-sonnet-4-6
- Embeddings: OpenAI text-embedding-3-small OR HuggingFace BAAI/bge-small-en-v1.5
- Vector DB: Chroma (local dev) / Pinecone (production)
- Chunking: RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
- Retrieval: MMR (search_type="mmr") with k=6, fetch_k=20
- Chain type: ConversationalRetrievalChain with chat history buffer
- Async: use ainvoke() for all chain calls in production (FastAPI context)

claude "add a RAG pipeline for a PDF Q&A system.
Ingestion: load PDFs from data/documents/ using PyPDFLoader; chunk with RecursiveCharacterTextSplitter.
Embedding: HuggingFace BAAI/bge-small-en-v1.5 (local, no API cost).
Vector store: Chroma persisted to .chroma_db/.
Retrieval: MMR with k=6.
Chain: ConversationalRetrievalChain with Claude claude-sonnet-4-6, system prompt in prompts/qa_system.txt.
Expose as FastAPI endpoint POST /chat with {question: str, chat_history: list}.
Include ingestion script: uv run python -m src.ingest --docs-dir data/documents/"

HuggingFace Fine-tuning with PEFT/LoRA

# CLAUDE.md addition for HuggingFace fine-tuning

## Fine-tuning conventions
- Framework: trl.SFTTrainer (supervised fine-tuning)
- PEFT method: LoRA via peft library
- Base model: specify (e.g. mistralai/Mistral-7B-Instruct-v0.3)
- Dataset format: HuggingFace Dataset with 'text' column (formatted instruction + response)
- LoRA config: r=16, lora_alpha=32, lora_dropout=0.05, bias="none"
  target_modules: ["q_proj", "k_proj", "v_proj", "o_proj"] for most models
- Training: bf16=True (A100), per_device_train_batch_size=4, gradient_accumulation_steps=4
- Hub: push adapter to HuggingFace Hub after training (push_to_hub=True)
- Merge: optionally merge adapter into base model for deployment

claude "add a fine-tuning script for instruction following.
Base model: mistralai/Mistral-7B-Instruct-v0.3 (load in 4-bit with bitsandbytes).
Dataset: JSONL in data/train.jsonl with 'instruction' and 'response' fields.
Format as Alpaca-style prompt template; tokenize to max_length=2048.
LoRA: r=16, alpha=32, target q_proj + v_proj.
Train 3 epochs, save checkpoint every 500 steps.
Evaluate: ROUGE-L score on data/eval.jsonl every epoch.
Push final adapter to HuggingFace Hub as {username}/{project}-lora."

Jupyter & Notebook Workflows

Workflow	Best practice with Claude Code
Exploratory analysis	"Write exploration code in notebooks/explore.ipynb; extract reusable functions to src/utils/ — tell Claude to write to the .py file, not the notebook"
Jupytext sync	"Use Jupytext to sync notebook.ipynb ↔ scripts/notebook.py; Claude edits the .py file; Jupyter auto-syncs the notebook"
Papermill execution	"Parametrize notebooks with papermill; run with: papermill template.ipynb output.ipynb -p param value"
nbconvert for CI	"Convert notebook to script for CI: jupyter nbconvert --to script --execute notebook.ipynb —catches execution errors"

Experiment Tracking

claude "add W&B experiment tracking to the training script.
Init run: wandb.init(project=cfg.project, config=cfg, tags=[cfg.model_name, cfg.dataset]).
Log each epoch: wandb.log({'train/loss': ..., 'val/loss': ..., 'val/accuracy': ..., 'epoch': n}).
Log best model as W&B artifact: artifact.add_file(checkpoint_path).
Add a --no-wandb CLI flag for offline runs.
Log gradient norms and learning rate every 100 steps for debugging."

claude "add MLflow tracking to the training script (alternative to W&B).
Set experiment: mlflow.set_experiment(cfg.experiment_name).
Log params at start: mlflow.log_params(cfg.__dict__).
Log metrics each epoch with step number.
Log model artifact using mlflow.pytorch.log_model.
Add MLFLOW_TRACKING_URI env var support (default: http://localhost:5000).
Include mlflow ui --port 5000 in CLAUDE.md commands."

5 Tips for AI/ML + Claude Code

Always specify your GPU model and VRAM in CLAUDE.md: "GPU: RTX 4090 24GB". Claude will generate appropriate batch sizes, gradient accumulation settings, and precision flags (bf16 for A100+, fp16 for older GPUs) instead of using defaults that may cause OOM errors.
Paste your pyproject.toml or requirements.txt into CLAUDE.md. ML libraries have frequent breaking changes between versions (transformers 4.x vs 3.x, LangChain 0.1 vs 0.3). Specifying exact versions prevents Claude from using deprecated APIs.
For LLM API integrations in your ML project, use the Anthropic SDK directly instead of LangChain when the chain complexity is low — simpler code, better error messages, and lower latency. Specify in CLAUDE.md: "Use anthropic SDK directly for single-turn calls; LangChain only for multi-step chains."
Add a tests/test_shapes.py that verifies model forward pass output shapes with a tiny synthetic batch. Shape errors are the #1 ML bug, and catching them in a 10ms unit test beats discovering them after 2 hours of GPU training.
Seed everything in a single set_seed(42) utility function and call it at the top of every training script. Tell Claude in CLAUDE.md: "Call utils.set_seed(cfg.seed) at the start of every training and evaluation script." Reproducibility failures waste days of debugging.