Collections through website

The website data source feature enables you to efficiently gather and preserve content from various websites. This collected information can then be utilized to effectively respond to your customers' inquiries. This document provides a comprehensive guide on utilizing the Website Ingestion feature, emphasizing its configuration settings and offering illustrative examples for better understanding.


You can see below the available parameters that you can use, in order to configure the ingestion process:

Seed URLs

The seed URLs are the starting points for your crawl. They are the bases from which the crawler will begin exploring links. The crawler only visits URLs that match the seed URLs or their sub-directories.

If a seed URL is, the crawler explores this page and all the subpages it links to โ€” for example, and so on.

Excluded URLs

The excluded URLs are the ones you want the crawler to ignore. You can specify them in the same way as seed URLs. The crawler does not visit the URLs that match the excluded URLs or their sub-directories.

If an excluded URL is, the crawler does not explore this page or any pages under this directory.

Sitemap URLs

The sitemap URLs are the URLs of sitemaps from which the crawler can fetch a list of URLs to visit โ€” for example,

Excluded assets

The crawler ignores assets such as images, CSS files, JavaScript files and PDFs. The complete list can be found below:

  • PNG
  • JPG
  • JPEG
  • GIF
  • PDF
  • CSS
  • JS

Moveo crawls websites every 24 hours.

