> ## Documentation Index
> Fetch the complete documentation index at: https://docs.hyperspell.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Web Crawler

> Get answers from websites

export const IndexedSearch = () => <Tooltip tip="Indexed search uses advanced semantic, hybrid, and graph search methods to find the most relevant results. It's very fast and accurate, but requires upfront indexing of data.">
        <span className="text-sky-600">
            <Icon icon="database" size={14} color="#0084d1" iconType="solid" /> Indexed  Search
        </span>
    </Tooltip>;

export const LiveSearch = () => <Tooltip tip="Live search uses AI to search your integrations native APIs in real-time. This is slower than indexed search and often not as accurate, but does not store any data on Hyperspell's servers and is always up to date.">
        <span className="text-emerald-600">
            <Icon icon="bolt" size={14} color="#009966" iconType="solid" /> Live Search</span>
    </Tooltip>;

# Overview

The Web Crawler integration allows you to extract information from websites. While the integration offers both <LiveSearch /> and <IndexedSearch />, we highly recommend indexing a website first before you query it: larger websites can take several minutes to index, and the live search will be limited to the front page.

## Authentication

The Web Crawler integration does not require any authentication and is available to all users.

## Search Options

Following options are available for the `options` parameter of `web_crawler`:

<ParamField body="url" type="string" required={true}>
  The URL of the website to crawl. Trailing slashes are ignored.
</ParamField>

<ParamField body="max_depth" type="number" required={false} default={0}>
  The maximum depth of the website to crawl. 0 means only the root page will be queried, 1 means the root page and all pages linked from it will be queried, and so on.
</ParamField>

## Resources

The Web Crawler integration returns `Website` resources:

<Expandable title="Website Schema">
  <ResponseField name="resource_id" type="string" required>
    The unique identifier of the website, typically the URL.
  </ResponseField>

  <ResponseField name="source" type="string" required>
    The provider that fetched the website, which is `web_crawler` if this integration is used to query or index it.
  </ResponseField>

  <ResponseField name="url" type="string" required>
    The URL of the website
  </ResponseField>

  <ResponseField name="title" type="string">
    The title of the website, extracted from the HTML title tag.
  </ResponseField>

  <ResponseField name="description" type="string">
    The description of the website, extracted from the HTML meta description tag or Open Graph descriptions.
  </ResponseField>

  <ResponseField name="image_url" type="string" required={false}>
    The image URL of the website, extracted from the HTML meta description tag or Open Graph images.
  </ResponseField>

  <ResponseField name="language" type="string" required={false}>
    The language of the website, extracted from the HTML meta language tag.
  </ResponseField>

  <ResponseField name="favicon" type="string" required={false}>
    The favicon URL of the website, extracted from the HTML head.
  </ResponseField>

  <ResponseField name="summary" type="text" required={false}>
    A summary of the document that can be fed directly into LLMs. When retrieved from the `/query` endpoint, this may summarize only the sections of the document returned as highlights (ie. parts relevant to your query). Otherwise, it will summarize the entire document.
  </ResponseField>

  <ResponseField name="data" type="elements[]" required={false}>
    A structured representation of the website's content. This field is only returned if the resource is returned from the `/documents/get` endpoint. If the website is returned from the `/query` endpoint, this field will be empty and the `highlights` field will contain the relevant sections of the website.
  </ResponseField>

  <ResponseField name="highlights" type="highlight[]" required={false}>
    A list of highlights from the website relevant to the query. This field is only returned if the resource is returned from the `/query` endpoint.
  </ResponseField>
</Expandable>

## Additional Endpoints

### Index a website before querying it

`GET /integrations/web_crawler/index`

Call this endpoint to index a website for indexed search. The website will be crawled recursively and added to the search index.

<Expandable title="parameters">
  <ParamField query="url" type="string" required={true} />

  <ParamField query="max_depth" type="number" required={false} default={2}>
    The maximum depth of the website to crawl. 0 means only the root page will be queried, 1 means the root page and all pages linked from it will be queried, and so on.
  </ParamField>

  <ParamField query="max_pages" type="number" required={false} default={50}>
    The maximum number of pages to crawl.
  </ParamField>
</Expandable>
