Collections through website
The website data source feature enables you to efficiently gather and preserve content from various websites. This collected information can then be utilized to effectively respond to your customers' inquiries. This document provides a comprehensive guide on utilizing the Website Ingestion feature, emphasizing its configuration settings and offering illustrative examples for better understanding.
Configuration
You can see below the available parameters that you can use, in order to configure the ingestion process:
Seed URLs
The seed URLs are the starting points for your crawl. They are the bases from which the crawler will begin exploring links. The crawler only visits URLs that match the seed URLs or their sub-directories.
If a seed URL is https://example.com/, the crawler explores this page and all the subpages it links to — for example https://example.com/blog/, https://example.com/about-us/ and so on.
Excluded URLs
The excluded URLs are the ones you want the crawler to ignore. You can specify them in the same way as seed URLs. The crawler does not visit the URLs that match the excluded URLs or their sub-directories.
If an excluded URL is https://example.com/blog/, the crawler does not explore this page or any pages under this directory.
Sitemap URLs
The sitemap URLs are the URLs of sitemaps from which the crawler can fetch a list of URLs to visit — for example, https://example.com/sitemap.xml
Excluded assets
The crawler ignores assets such as images, CSS files, JavaScript files and PDFs. The complete list can be found below:
- PNG
- JPG
- JPEG
- GIF
- CSS
- JS
Moveo crawls websites every 24 hours.
Ingestion report
The Ingestion Report allows you to review the results of your website's data collection process. By clicking on it, you can access detailed insights about how your site was ingested, including the duration of the process and any issues encountered, such as problematic URL fields.
- Datasource
- Report
Troubleshooting
In some instances, security services such as Cloudflare may obstruct the Moveo crawler's access to a website. To circumvent this issue, it's necessary to include the Moveo crawler in your Verified bots list. Within Cloudflare, you can accomplish this by navigating to Manage Account > Configurations > Verified Bots.
In this page, you should input the following details:
- Bot name: Moveo.AI
- User-Agents Match Pattern:
collection-indexer
- Bot Crawler Category: AI Crawler
- Verification Method: IP List
- Validation Instructions:
18.192.167.150
,18.198.233.220
,3.66.239.254
- Cloudflare
- Add verified bot