← Back to Blog AI

Practical AI: What Actually Works in Production

15 January 2025 · 8 min read · NotchAI Team

Everyone has seen impressive AI demos. A chatbot that sounds human. A model that writes code. A system that summarises documents in seconds. The demos are real — but building AI products that work reliably at scale is an entirely different challenge.

Over the past two years, we've shipped AI-powered features across fintech, e-commerce, and SaaS products. Here's what we've actually learned — not the polished version, but the real lessons from production.

The Demo-to-Production Gap

The hardest part of building AI products isn't the AI. It's everything around it. Demos work under ideal conditions — curated inputs, forgiving evaluation, hand-picked examples. Production is the opposite: messy data, edge cases, adversarial users, and latency budgets measured in milliseconds.

The first thing we tell every new client: your AI feature is not done when it works on your test set. It's done when it handles inputs you didn't anticipate, fails gracefully when it's wrong, and stays fast enough that users don't notice it's there.

Retrieval-Augmented Generation (RAG) is Not Magic

RAG has become the default architecture for giving LLMs access to private data. Connect a vector database, embed your documents, retrieve relevant chunks, inject them into the prompt. Simple in theory.

In practice, we've seen RAG implementations fail for these common reasons:

Chunking strategy matters enormously. Splitting on fixed character counts loses context. Semantic chunking and overlap are worth the extra complexity.
Retrieval quality ≠ generation quality. Retrieving the right documents is step one. Prompting the model to use them well is step two.
Embedding model drift. If you update your embedding model, re-embed everything. Mixing embeddings from different models silently breaks retrieval.
Reranking improves accuracy. A cross-encoder reranker on top of vector search meaningfully improves results for most use cases.

Latency is a Feature

Users will tolerate a lot from AI — occasional errors, imperfect formatting, short answers. They will not tolerate slow responses. In our experience, anything over 3 seconds kills engagement on chat-style interfaces.

Practical strategies we use to keep latency down:

Stream responses. First-token latency matters more than total time. Start showing output immediately.
Use smaller models where possible. A well-prompted GPT-4o-mini or Gemini Flash is fast and cheap. Reach for larger models only when quality demands it.
Cache aggressively. Deterministic queries (same input → same output) are cacheable. Semantic caching with embeddings can extend this further.
Async where you can. If the AI result isn't needed in real-time, generate it in the background.

Evaluation is Non-Negotiable

AI systems that aren't evaluated regress silently. A prompt tweak that improves one case often breaks another. Model updates change behaviour. You need an eval pipeline before you ship, not after.

We build evals for every AI feature we ship. They don't need to be complex — a hundred golden examples with expected outputs and a scoring script is enough to catch regressions. Run them on every deploy.

"If you're not measuring it, you're not shipping AI — you're shipping hope."

Prompting is Engineering

Prompt engineering gets a bad reputation because it sounds like magic incantations. It's not. It's software engineering applied to natural language interfaces. Good prompts are structured, versioned, tested, and reviewed like any other code.

A few patterns that consistently work:

Be explicit about output format. JSON schema, markdown structure, or plain text — specify it.
Give examples (few-shot). Two or three examples in the prompt are often worth more than a paragraph of instructions.
Separate the persona from the task. "You are X. Your task is Y." is clearer than mixing them.
Tell the model what to do when it doesn't know. "If you don't have enough information, say so" prevents confident hallucination.

The Bottom Line

AI in production is a systems problem as much as a model problem. The teams that ship reliable AI products are the ones who invest in evaluation, treat latency as a product requirement, and build the scaffolding around the model as carefully as the model integration itself.

If you're building an AI product and want to talk through your architecture, get in touch. We'd love to help.

← Back to Blog Work with us