Collections through website
The website data source feature enables you to efficiently gather and preserve content from various websites. This collected information can then be utilized to effectively respond to your customers' inquiries. This document provides a comprehensive guide on utilizing the Website Ingestion feature, emphasizing its configuration settings and offering illustrative examples for better understanding.
Configuration
You can see below the available parameters that you can use, in order to configure the ingestion process:
Seed URLs
The seed URLs are the starting points for your crawl. They are the bases from which the crawler will begin exploring links. The crawler only visits URLs that match the seed URLs or their sub-directories.
If a seed URL is https://example.com/, the crawler explores this page and all the subpages it links to โ for example https://example.com/blog/, https://example.com/about-us/ and so on.
Excluded URLs
The excluded URLs are the ones you want the crawler to ignore. You can specify them in the same way as seed URLs. The crawler does not visit the URLs that match the excluded URLs or their sub-directories.
If an excluded URL is https://example.com/blog/, the crawler does not explore this page or any pages under this directory.
Sitemap URLs
The sitemap URLs are the URLs of sitemaps from which the crawler can fetch a list of URLs to visit โ for example, https://example.com/sitemap.xml
Excluded assets
The crawler ignores assets such as images, CSS files, JavaScript files and PDFs. The complete list can be found below:
- PNG
- JPG
- JPEG
- GIF
- CSS
- JS
note
Moveo crawls websites every 24 hours.