Building Smarter Web Search Workflows

Learn how to integrate programmable web search into your AI workflows using tools like SerpAPI, Tavily, SearxNG, Exa, and Google CSE. This guide covers geo-targeted queries, proxy strategies, and building a live, LangChain-powered search-to-summarization pipeline with Streamlit and OpenAI.

Building Smarter Web Search Workflows
Photo by Jametlene Reskp / Unsplash

Search is central to intelligent applications, whether it's for real-time data retrieval, summarization, research, or chatbot interactions.

In this post, we'll explore a variety of tools and APIs that enable programmable web search, show how to incorporate them into Python workflows, discuss proxying strategies for geographic targeting, and build a simple LangChain-powered pipeline for an end-to-end application.

Overview of Web Search Tools

Here's a breakdown of the most useful web search APIs and meta search engines:

Tool Description Python Support Geo-targeting Support Free Tier Availability
SerpApi Google Search API wrapper Yes Yes (location param) Yes (limited)
SearxNG Open source meta search engine Yes Yes (via locale) Yes
Tavily Fast, hosted web search API Yes Yes (location param) Yes
Exa AI AI-powered semantic search Yes Yes (location param) Yes
Google Programmable Search Custom Google-powered search Yes Yes (gl, hl) Yes

SerpAPI

SerpAPI wraps around Google search and supports advanced options like rich snippets, shopping, and maps.

from serpapi import GoogleSearch

params = {
    "q": "latest AI tools",
    "location": "London, UK",
    "api_key": "YOUR_API_KEY"
}
search = GoogleSearch(params)
results = search.get_dict()

SearxNG aggregates results from multiple engines and is privacy-focused. It can be used via direct HTTP requests or through Python.

import requests

params = {
    'q': 'best pizza in NYC',
    'format': 'json',
    'locale': 'en-US'
}
response = requests.get("http://localhost:8888/search", params=params)
print(response.json())

To use SearxNG, you can either connect to public instances or deploy your own with Docker.

Tavily

Tavily offers fast web search with simple API access.

from tavily import TavilyClient

client = TavilyClient("tvly-YOUR_API_KEY")
results = client.search("tech news", location="Berlin, Germany")

Exa AI

Exa allows semantic and keyword-based web search with fast indexing and relevance filtering.

import requests

headers = {'Authorization': 'Bearer YOUR_API_KEY'}
data = {'query': 'AI research papers', 'location': 'San Francisco, CA'}
response = requests.post('https://api.exa.ai/v1/search', headers=headers, json=data)

Google Custom Search (CSE) can be used for limited programmatic queries.

import requests

params = {
    'q': 'machine learning tutorials',
    'cx': 'YOUR_CX_ID',
    'key': 'YOUR_API_KEY',
    'gl': 'US',
    'hl': 'en'
}
response = requests.get('https://www.googleapis.com/customsearch/v1', params=params)

Proxies and Geographic Relevance

For applications that need results from specific regions or to bypass geofencing, proxies are essential.

Using SOCKS5 Proxies in Python

SOCKS5 proxies route all traffic through a specified server, masking origin IP and location, handy especially when working on cloud servers because their IP ranges are a little too well known to be servers and not genuine human traffic.

Install PySocks:

pip install pysocks
import requests

proxies = {
    'http': 'socks5h://user:pass@proxy.example.com:1080',
    'https': 'socks5h://user:pass@proxy.example.com:1080'
}
response = requests.get("http://example.com", proxies=proxies)

Webshare.io

Webshare provides a pool of proxies for various regions.

proxies = {
    'http': 'http://username:password@proxy.webshare.io:80',
    'https': 'http://username:password@proxy.webshare.io:80'
}

Cloudflare WARP

Cloudflare WARP encrypts traffic and can be configured to route through specific regions. Cloudflare WARP can be connected using:

warp-cli connect

Proxy Services

Service Type Authenticated Location Control
Webshare HTTP(S) Yes Yes
Cloudflare WARP VPN/CLI No Limited control
Bright Data Residential / Mobile Yes Yes
Oxylabs Datacenter / Rotating Yes Yes

Building a LangChain Search Pipeline

Integrating search results into a conversational AI can enhance user interactions. We'll now build a simple LangChain-based application that searches the web, converts HTML to Markdown, and uses OpenAI to summarize the result.

  1. Search with SerpApi:
  2. Fetch the HTML content from search results
  3. Convert HTML to Markdown:
  4. Generate Chat Response with OpenAI:

Example Code (Streamlit UI)

# pip install streamlit langchain_community langchain_openai python-dotenv
# Requires the two environment variables: OPENAI_API_KEY & SERPAPI_API_KEY in a .env file
import streamlit as st
from langchain_community.utilities import SerpAPIWrapper
from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.document_transformers import MarkdownifyTransformer
from langchain_openai import ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from dotenv import load_dotenv

load_dotenv()

NUM_WEB_RESULTS=3
CHUNKS_PER_DOCUMENT=2

def main():
    st.title("AI-Powered Search Chatbot")
    q=st.text_input("Enter your query:")
    if q:
        try:
            sr = SerpAPIWrapper().results(q).get('organic_results',[])[:NUM_WEB_RESULTS]
            urls = [r.get('link') for r in sr if r.get('link')]; 
            if not urls: return
            md = MarkdownifyTransformer().transform_documents(AsyncHtmlLoader(urls).load())
            ts = RecursiveCharacterTextSplitter(chunk_size=2000,chunk_overlap=200,length_function=len)
            cc = "\n\n---\n\n".join(chunk for d in md for chunk in ts.split_text(d.page_content)[:CHUNKS_PER_DOCUMENT])
            st.write(
                ChatOpenAI(model="gpt-3.5-turbo").invoke(
                    f"Based on the following web search results, provide a comprehensive answer to the question: '{q}'\n\nSearch Results:\n{cc}\n\nPlease provide a balanced, factual summary based on the information above."
                ).content
            )
        except Exception as e: 
            st.error(f"Error: {e}")

if __name__=="__main__":
    main()
0:00
/0:29

This minimal streamlit application performs live search, processes the content, and interacts conversationally, all with a few lines of code.

Sources

This guide should help you build customized, geo-aware search pipelines for intelligent applications. Whether you're building a research assistant or a domain-specific chatbot, these tools provide the foundation for high-quality results and meaningful user interactions.