Building a RAG Based Web Crawler to Answer Questions from Your Website Content

Jul 13, 2025
llm langchain

🧠 Building a RAG Based Web Crawler to Answer Questions from Your Website Content

In the age of AI, making your website content searchable and conversational is not just a nice-to-have—it’s a superpower. Imagine if a user could ask your site: “How many blogs are there?”—and get a precise answer pulled directly from your live content. In this blog, we’ll walk through how to build just that using Python, LangChain, OpenAI, and a bit of web scraping magic.

🕸️ Step 1: Recursively Crawl a Website

We start by crawling all internal links from a given domain using requests and BeautifulSoup.

def get_all_links_recursive(url, domain=None, visited=None):
    if visited is None:
        visited = set()
    if domain is None:
        domain = urlparse(url).netloc
    if url in visited:
        return visited
    visited.add(url)
    try:
        response = requests.get(url, timeout=5)
        soup = BeautifulSoup(response.text, 'html.parser')
        for a_tag in soup.find_all('a', href=True):
            link = urljoin(url, a_tag['href'])
            parsed_link = urlparse(link)
            if parsed_link.netloc == domain and link not in visited:
                get_all_links_recursive(link, domain, visited)
    except Exception as e:
        print(f"Failed to fetch {url}: {e}")
    return visited

This recursive function captures every reachable page under the same domain, so we can analyze the complete website.

✅ Input: Root URL (e.g., https://anshumansingh.me)
✅ Output: Set of all valid, internal links

📄 Step 2: Load All Pages as Documents

Using LangChain’s WebBaseLoader, we load the content from all gathered URLs.

all_links_list = list(all_links)
loader = WebBaseLoader(web_path=all_links_list)
docs = loader.load()

This gives us a list of documents that represent the website’s contents in raw HTML/text.

✂️ Step 3: Split Text into Chunks

Large documents can’t be processed all at once. So, we use a RecursiveCharacterTextSplitter to break them into manageable pieces.

splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
final_docs = splitter.split_documents(docs)

This ensures the language model can process each section effectively and retrieve relevant content with high accuracy.

🧠 Step 4: Create a Vector Store for Retrieval

Now we move into the world of vector databases! Using OpenAI’s embedding model and FAISS, we turn our text chunks into vectors.

embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(documents=final_docs, embedding=embeddings)

This enables semantic search over the website content.

🤖 Step 5: Build the Retrieval-Augmented Generation (RAG) Chain

We set up an LLM (GPT-4o) to answer questions, but only based on the indexed website content.

llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_template("...")
document_chain = create_stuff_documents_chain(llm, prompt)
retriever = db.as_retriever()
retrieval_chain = create_retrieval_chain(retriever, document_chain)

This architecture ensures factual, grounded responses. No hallucinations—just answers from your content.

❓ Step 6: Ask Questions Like a Pro

Finally, you can ask your website anything:

response = retrieval_chain.invoke({"input": "how many blogs are found?"})
print(response)

This will return a response based strictly on the crawled content. You can adapt this for chatbots, search bars, or internal knowledge bases.

🧩 Wrapping Up

You’ve now built an intelligent web assistant that:

Crawls and indexes any website
Splits content intelligently
Embeds data into a vector store
Answers natural language queries using the latest LLMs

🔧 Tools Used:

requests, BeautifulSoup – for crawling
LangChain, FAISS – for document and vector processing
OpenAI – for embedding and LLM querying

This setup is highly customizable. You can plug in other LLMs, add filtering, or even schedule regular crawls.

🚀 Ready to Make Your Site Conversational?

Clone this repo: https://github.com/anshuman-singh-93/webcrawler-rag