Build a RAG System with Dify (Step-by-Step Guide)

Introduction

In this tutorial, you will build a complete Retrieval-Augmented Generation (RAG) system using Dify. By the end, you’ll have a working AI application that answers user queries by retrieving relevant chunks from your own documents and generating informed responses with a large language model. This hands-on guide focuses purely on implementation, from knowledge base creation to testing the final pipeline.

Prerequisites

A running Dify instance (local or cloud). If you haven’t installed it yet, follow the Dify Installation Guide.
An LLM provider API key configured in Dify (e.g., OpenAI, Azure OpenAI, Anthropic).
Basic familiarity with embeddings and vector search concepts.
One or more text‑based documents (PDF, TXT, Markdown, DOCX) to use as the knowledge source.

Step 1: Create a Dify App

Log in to your Dify workspace.
Navigate to Studio and click Create from Template or Create App.
Select the Chat App template. This provides a conversational interface suitable for RAG.
Name your app (e.g., “Document Q&A”) and confirm.

A blank chat application canvas opens, ready for configuration.

Step 2: Create a Knowledge Base

Dify manages documents through a knowledge base (KB). The KB will handle ingestion, chunking, and embedding of your files.

In the app’s left sidebar, switch to the Knowledge tab.
Click Create Knowledge.
Provide a name (e.g., “Company Manuals”) and optional description.
Upload your documents by dragging them into the upload area. Supported formats include .pdf, .txt, .md, .docx, .html, and more.
Configure the Chunking strategy:
- Automatic – Dify splits documents based on model-optimized parameters. Suitable for general use.
- Custom – Allows you to set chunk size (e.g., 500 tokens) and overlap (e.g., 50 tokens). Use this for fine-grained control over retrieval precision.
Choose an Embedding model:
- If you already configured an embedding provider (e.g., text-embedding-ada-002 from OpenAI, or a local model), select it here.
- Embeddings are generated automatically once you save the knowledge base.
Click Save & Embed. Dify processes the documents, splits them into chunks, and stores vector representations in the connected vector database. This can take a few seconds to minutes depending on document size.

After completion, the knowledge base status changes to Ready.

Step 3: Configure the RAG Pipeline

Now, connect the knowledge base to your chat app and set retrieval parameters.

Return to the App Settings tab (the original chat canvas).
Under Context, enable Knowledge Retrieval.
Click Add Knowledge and select the knowledge base you just created.
Configure retrieval settings:
- Top‑K – Number of document chunks retrieved for each query. Start with 3. Higher values add more context but also noise.
- Score threshold – Minimum similarity score (0–1) for a chunk to be included. A value of 0.5 works well initially; lower it if results are sparse.
- Re‑rank (optional) – Enable a re‑ranking model if your provider supports it, to improve result relevance after retrieval.
Context injection – By default, Dify prepends the retrieved chunks into the system prompt. You can customize the prompt template to control how context is presented to the LLM. The default works for most cases.

The RAG pipeline is now active: every user message will trigger a semantic search against the knowledge base and feed the results to the model.

Step 4: Connect the LLM

If you haven’t already, configure the language model that will generate final answers.

Go to Settings → Model Provider (workspace‑level) and ensure your API key is added.
Back in the app, open App Settings and go to Model.
Choose the provider and model (e.g., gpt-4o, claude-3-opus, or llama3 via Ollama).
Set parameters like Temperature (0.3 for factual Q&A) and Max Tokens.
In the Prompt section, review the generated prompt. Dify automatically includes a {{#context#}} variable that inserts the retrieved chunks. You can edit the prompt to add instructions like:
```
Use the following context to answer the question. If you don't know, say so.
Context: {{#context#}}
```

The app now has a complete data flow: query → retrieval → context injection → LLM generation.

Step 5: Build the Chat Interface

The Chat App mode already provides a functional UI. No extra frontend work is needed.

In App Settings, you can customize the opening message (“Hi, ask me anything about the documents.”) and suggested questions.
Enable Conversation history to let the model refer to previous exchanges during multi‑turn conversations.

To test the interface immediately, click Publish (green button) and open the published app link, or use the embedded preview inside the studio.

Architecture Overview

The RAG system you just built follows this flow:

User Query → Dify App → Knowledge Retrieval → Vector DB (chunks) → Context Assembly → LLM → Response

Knowledge Retrieval – Transforms the user query into an embedding using the same model as the knowledge base, then performs similarity search against the vector database.
Context Assembly – The top‑K chunks are inserted into the prompt template.
LLM – Generates a final answer grounded in the provided context.

Everything runs within Dify’s orchestration layer, eliminating the need to manually code retrieval logic.

Testing the RAG System

Open the published app (or use the Studio preview).
Send queries that relate directly to your uploaded documents. For example, if you uploaded a product manual, ask: “How do I reset the device to factory settings?”
Verify that:
- The response contains information from your documents.
- No hallucination occurs when the answer is present in the documents.
- If the answer is not in the documents, the model returns a fallback (e.g., “I don’t find that information in the provided documents”) based on your prompt instructions.

If retrieval is working, you can see which chunks were used by clicking the View Logs button in the app’s left sidebar, then inspecting a conversation. The retrieved document sections are displayed alongside the model’s response.

Troubleshooting

Poor retrieval results (irrelevant chunks returned)

Increase chunk size or enable overlap in the knowledge base settings to capture more context per chunk.
Lower the score threshold to include more results, or increase Top‑K.
Switch to a more powerful embedding model (e.g., from text-embedding-ada-002 to text-embedding-3-large).
Preprocess documents to remove poorly formatted text or scanned images (Dify works best with clean, extractable text).

Embedding mismatch

If you changed the embedding model after creating a knowledge base, the new embeddings won’t match the existing vectors. Delete and recreate the knowledge base with the desired model, or use the Re‑embed option if available.

Empty responses or “I don’t know” even when data exists

Check that the knowledge retrieval is enabled and the correct knowledge base is attached.
Ensure the prompt template contains {{#context#}} (or the variable you defined). Without it, the model never sees the retrieved chunks.
Verify the vector database container (weaviate, qdrant, etc.) is running and healthy (for self‑hosted users: docker compose ps).

High latency in retrieval

Reduce the number of chunks (Top‑K) or disable re‑ranking.
Use a faster embedding model or a model that runs locally to reduce API call overhead.

Summary

You’ve built a fully operational RAG pipeline inside Dify. The system ingests your own documents, indexes them as vector embeddings, retrieves the most relevant parts at query time, and generates a grounded response using an LLM. From here, you can integrate the app into external systems via Dify’s API, embed it in a website with the chat widget, or extend the logic with workflows and agents.

Introduction​

Prerequisites​

Step 1: Create a Dify App​

Step 2: Create a Knowledge Base​

Step 3: Configure the RAG Pipeline​

Step 4: Connect the LLM​

Step 5: Build the Chat Interface​

Architecture Overview​

Testing the RAG System​

Troubleshooting​

Poor retrieval results (irrelevant chunks returned)​

Embedding mismatch​

Empty responses or “I don’t know” even when data exists​

High latency in retrieval​

Summary​