Building a RAG Based Web Crawler to Answer Questions from Your Website Content
llm langchain
🧠 Building a RAG Based Web Crawler to Answer Questions from Your Website Content
In the age of AI, making your website content searchable and conversational is not just a nice-to-have—it’s a superpower. Imagine if a user could ask your site: “How many blogs are there?”—and get a precise answer pulled directly from your live content. In this blog, we’ll walk through how to build just that using Python, LangChain
, OpenAI
, and a bit of web scraping magic.
🕸️ Step 1: Recursively Crawl a Website
We start by crawling all internal links from a given domain using requests
and BeautifulSoup
.
def get_all_links_recursive(url, domain=None, visited=None):
if visited is None:
visited = set()
if domain is None:
domain = urlparse(url).netloc
if url in visited:
return visited
visited.add(url)
try:
response = requests.get(url, timeout=5)
soup = BeautifulSoup(response.text, 'html.parser')
for a_tag in soup.find_all('a', href=True):
link = urljoin(url, a_tag['href'])
parsed_link = urlparse(link)
if parsed_link.netloc == domain and link not in visited:
get_all_links_recursive(link, domain, visited)
except Exception as e:
print(f"Failed to fetch {url}: {e}")
return visited
This recursive function captures every reachable page under the same domain, so we can analyze the complete website.
✅ Input: Root URL (e.g., https://anshumansingh.me
)
✅ Output: Set of all valid, internal links
📄 Step 2: Load All Pages as Documents
Using LangChain
’s WebBaseLoader
, we load the content from all gathered URLs.
all_links_list = list(all_links)
loader = WebBaseLoader(web_path=all_links_list)
docs = loader.load()
This gives us a list of documents that represent the website’s contents in raw HTML/text.
✂️ Step 3: Split Text into Chunks
Large documents can’t be processed all at once. So, we use a RecursiveCharacterTextSplitter
to break them into manageable pieces.
splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
final_docs = splitter.split_documents(docs)
This ensures the language model can process each section effectively and retrieve relevant content with high accuracy.
🧠 Step 4: Create a Vector Store for Retrieval
Now we move into the world of vector databases! Using OpenAI’s embedding model and FAISS
, we turn our text chunks into vectors.
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(documents=final_docs, embedding=embeddings)
This enables semantic search over the website content.
🤖 Step 5: Build the Retrieval-Augmented Generation (RAG) Chain
We set up an LLM (GPT-4o) to answer questions, but only based on the indexed website content.
llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_template("...")
document_chain = create_stuff_documents_chain(llm, prompt)
retriever = db.as_retriever()
retrieval_chain = create_retrieval_chain(retriever, document_chain)
This architecture ensures factual, grounded responses. No hallucinations—just answers from your content.
❓ Step 6: Ask Questions Like a Pro
Finally, you can ask your website anything:
response = retrieval_chain.invoke({"input": "how many blogs are found?"})
print(response)
This will return a response based strictly on the crawled content. You can adapt this for chatbots, search bars, or internal knowledge bases.
🧩 Wrapping Up
You’ve now built an intelligent web assistant that:
- Crawls and indexes any website
- Splits content intelligently
- Embeds data into a vector store
- Answers natural language queries using the latest LLMs
🔧 Tools Used:
requests
,BeautifulSoup
– for crawlingLangChain
,FAISS
– for document and vector processingOpenAI
– for embedding and LLM querying
This setup is highly customizable. You can plug in other LLMs, add filtering, or even schedule regular crawls.
🚀 Ready to Make Your Site Conversational?
Clone this repo: https://github.com/anshuman-singh-93/webcrawler-rag