LangChain & LlamaIndex: Build with LLMs · AI Engineer

3LangChain & LlamaIndex: Build with LLMs

Why frameworks exist

You want to build a chatbot that answers questions about your company's documentation. Sounds simple enough. But then you start listing what you actually need: loading PDFs, Markdown, and HTML files. Splitting them into chunks. Generating embeddings. Storing them in a vector database. Accepting user questions. Searching for relevant chunks. Combining chunks with the question into a prompt. Sending it to an LLM. Formatting the response. Keeping conversation history. Handling errors.

That's a lot of plumbing. And you haven't written a single line of business logic yet.

This is why LangChain and LlamaIndex exist. They give you building blocks for all that infrastructure, so you can focus on what your app actually does.

LangChain: the orchestration framework

LangChain is the Swiss Army knife of LLM development. It started in late 2022 and quickly became the most popular framework for building LLM applications.

The core idea: break LLM workflows into composable pieces and snap them together.

Chains

A chain is a sequence of steps. The simplest one takes a prompt template, fills in variables, sends it to an LLM, and returns the result.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
 
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant that translates {input_language} to {output_language}."),
    ("human", "{text}")
])
 
llm = ChatOpenAI(model="gpt-4o-mini")
 
chain = prompt | llm  # The pipe operator chains them together
 
result = chain.invoke({
    "input_language": "English",
    "output_language": "French",
    "text": "I love programming"
})

That | operator is LangChain's way of connecting steps. Output of one becomes input of the next. You can chain as many steps as you want — prompt, LLM, parser, another prompt, another LLM.

Tools

Tools let your LLM interact with the outside world. A "tool" is just a function the model can decide to call.

from langchain_core.tools import tool
 
@tool
def search_database(query: str) -> str:
    """Search the product database for items matching the query."""
    # Your database logic here
    return f"Found 3 results for: {query}"
 
@tool
def calculate(expression: str) -> str:
    """Evaluate a mathematical expression."""
    return str(eval(expression))

The model reads the tool descriptions and decides when to use them. Ask it "what's 15% of 2340?" and it'll call the calculator. Ask it "show me red shoes under 50 dollars" and it'll search the database.

Memory

By default, each LLM call is stateless. The model doesn't remember what you said two messages ago. Memory fixes that by storing conversation history and injecting it into each new prompt.

from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
 
history = ChatMessageHistory()
 
chain_with_memory = RunnableWithMessageHistory(
    chain,
    lambda session_id: history,
)

Different memory strategies exist — store everything, summarize old messages, keep only the last N turns. The right choice depends on your context window budget.

Output parsers

LLMs return text. But your app probably wants structured data. Output parsers extract JSON, lists, or specific fields from the model's response.

from langchain_core.output_parsers import JsonOutputParser
 
parser = JsonOutputParser()
chain = prompt | llm | parser
 
# Returns a Python dict instead of raw text
result = chain.invoke({"question": "List 3 programming languages"})

LlamaIndex: the data framework

If LangChain is about orchestrating LLM workflows, LlamaIndex is about connecting LLMs to your data. It was built specifically for the retrieval-augmented generation (RAG) use case.

Document loaders

LlamaIndex has loaders for just about everything — PDFs, Word docs, Notion pages, Slack messages, databases, websites, GitHub repos, Google Drive. There's even a community hub (LlamaHub) with hundreds of additional loaders.

from llama_index.core import SimpleDirectoryReader
 
# Load all documents from a folder
documents = SimpleDirectoryReader("./my_docs").load_data()

One line and your PDFs are loaded, parsed, and ready to be indexed.

Indexes

An index is how LlamaIndex organizes your data for fast retrieval. The most common type is a vector store index — it generates embeddings for each chunk and stores them so you can search by meaning.

from llama_index.core import VectorStoreIndex
 
index = VectorStoreIndex.from_documents(documents)

Behind the scenes, this splits your documents into chunks, generates an embedding vector for each chunk using an embedding model, and stores everything in memory (or a vector database if you configure one).

Query engines

A query engine ties it all together. You ask a question, it finds the most relevant chunks, and sends them to the LLM as context.

query_engine = index.as_query_engine()
response = query_engine.query("What is our refund policy?")
print(response)

The model's answer is grounded in your actual documents. If the refund policy says "30 days," the model says "30 days" — not whatever it learned during pre-training.

LangChain vs LlamaIndex

They're not competitors. They solve different problems, and many projects use both.

Aspect	LangChain	LlamaIndex
Focus	Workflow orchestration	Data retrieval
Best for	Chaining LLM calls, agents, tools	Q and A over documents, RAG
Abstraction level	General purpose	Data-focused
Agents	Built-in agent framework	Basic agent support
Data connectors	Some loaders	100+ data loaders
Learning curve	Steeper (more concepts)	Gentler (focused scope)

Use LangChain when you need complex multi-step workflows, tool-calling agents, or are connecting multiple LLMs and APIs together.

Use LlamaIndex when your main goal is answering questions over a collection of documents or building a search-over-your-data feature.

Use both when you want LlamaIndex for the data retrieval piece and LangChain for the overall application orchestration.

Start with the simplest tool

If your app is mostly "ask questions about my docs," LlamaIndex alone might be all you need. Don't add LangChain complexity unless your workflow actually requires it.

Building a Q and A bot

Here's a practical example: a Q and A bot that answers questions about your documentation.

# Using LlamaIndex for the retrieval
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
 
documents = SimpleDirectoryReader("./company_docs").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
 
# Simple usage
response = query_engine.query("How do I reset my password?")
print(response)

That's a working Q and A system in four lines. It loads your company docs, builds an index, and answers questions grounded in those docs.

For a more production-ready version with conversation history:

from llama_index.core.chat_engine import CondensePlusContextChatEngine
 
chat_engine = CondensePlusContextChatEngine.from_defaults(
    retriever=index.as_retriever(),
)
 
# Multi-turn conversation
response1 = chat_engine.chat("What's the refund policy?")
response2 = chat_engine.chat("Does that apply to digital products too?")

The chat engine remembers the conversation context, so "that" in the second question correctly refers to the refund policy.

When to use a framework vs raw API calls

Not every project needs a framework. Sometimes a direct API call is simpler and clearer.

Use raw API calls when:

Your workflow is simple (one prompt in, one response out)
You want full control over every detail
You're building something the frameworks don't support well
You want minimal dependencies

Use a framework when:

You're building RAG or document Q and A
You need conversation memory
Your workflow has multiple steps that depend on each other
You want agent behavior (tool calling, planning)
You don't want to write boilerplate for chunking, embedding, and retrieval

# Raw API call — simple and direct
from openai import OpenAI
client = OpenAI()
 
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response.choices[0].message.content)

For a straightforward "send prompt, get response" use case, the raw API is three lines and zero framework overhead. Don't overcomplicate things.

The framework debate

Frameworks get criticized sometimes. "Too much abstraction." "Changes every week." "I could write this in 50 lines."

Some of that criticism is fair. Early versions of LangChain especially were known for deep abstraction layers that made debugging painful. You'd get an error five layers deep and have no idea where it came from.

Both frameworks have matured significantly. LangChain simplified its API with LCEL (LangChain Expression Language). LlamaIndex streamlined its indexing pipeline. The abstractions are cleaner now.

But the core question remains: is the framework saving you time, or adding complexity you don't need?

My take: start without a framework. Build the simplest version with raw API calls. When you find yourself writing the same boilerplate over and over — document chunking, embedding storage, conversation history — that's when a framework pays off.

Alternatives worth knowing

LangChain and LlamaIndex are the biggest players, but not the only ones. Haystack focuses on search and RAG with a clean API. Semantic Kernel (Microsoft) integrates well with Azure. CrewAI is built for multi-agent workflows. Instructor is a lightweight library just for structured LLM outputs. Vercel AI SDK targets JavaScript/TypeScript developers building AI into web apps.

The ecosystem moves fast. The important thing isn't which framework you pick — it's understanding the patterns (chains, retrieval, memory, tools) so you can use any of them.

What's next?

You've got the frameworks. You've got the models. You've built a Q and A bot that actually works.

But right now everything runs on your laptop. What happens when real users start hitting it? How do you turn a Python script into an API that handles hundreds of concurrent requests without falling over?

In the next article, we'll tackle serving AI models — wrapping them in FastAPI, managing GPU memory, handling batching, and taking the leap from a Jupyter notebook to a production REST API.