Website Ingestion Guide
This document provides a step-by-step guide on how to ingest a website into the ai12z AI platform. The process respects robots.txt
directives and utilizes sitemap data to determine the URLs to ingest. This is the most efficient way to import your website content. The alternative way is to upload a JSON file of URLs, used if you only want some URLs of a website to be uploaded.
Step 1: Initiating Ingestion
To start the website ingestion process:
-
Click on 'Add Website'.
-
Fill in the dialog box with the required metadata about the site.
Ensure the following details are provided:
- Name: A recognizable name for the website within your project.
- URL: The base address of the website you wish to ingest.
- Description: A brief description of the website's content.
- Include Patterns: Specify URL patterns you want to include. Use a forward slash
/
to denote the URL structure (e.g.,/corp/
then only pages that include /corp/ in the URL ). - Ingest PDF: If checked, if PDF is in the sitemap or if the site is being crawled, the PDF will be ingested as well as all the web site content.
- Filter By Language For example if you select english then only english content will be ingested. It is best practice to only include one language, if the content is just translated versions.
- Exclude Patterns: Specify URL patterns you want to exclude (e.g.,
/pages/
to exclude URLs containing this pattern). - Advance Settings: Typical default settings is what you use
- Authentication Type: 1
None
: default, most of the time you are ingesting content from a public site. 2.Basic
: User Name and Password, 3.Token
: Token authentication - Request type: Default choice is Auto, It auto detects the mode to use. You can force Sync, or Async. Async is meant to be use for sites that were built that after the page loads javascript will load the content. Sync is most popular and fastest for ingestion.
- Force Crawl: This will not use the sitemap. Use for example the sitemap is not up to date.
- ai12z user-agent:: The ai12z User Agent: ai12zCopilot/1.0, check this box if you want to specifiy the User Agent else it is just a browswer User Agent.
- Authentication Type: 1
Step 2: Robots.txt
The platform will attempt to read the robots.txt
file, which may include the initial sitemap or multiple sitemaps. If a crawl-delay is specified, the platform will adhere to it; otherwise, a default crawl-delay is used to avoid impacting website operations.
Step 3: Sitemap Extraction
The platform will process the sitemaps, which could be multiple for large sites.
A progress indicator will display during sitemap extraction. After about a minute, a dialog appears to notify you that this process could take a while. You can close your browser and log in later. An email notification will be sent to log in and go to the document action menu to click Continue Ingestion
.
If you close the dialog and the system has not completed the sitemap extraction, you can always check the status by looking at the action bar for information or status.
Step 4: Reviewing the URL Histogram
After the initial collection of URLs, a histogram will be presented for review. If you closed the dialog, you can return to the action menu and select Continue Ingestion
.
You will receive an email reminder to take action and apply filters.
Users should apply filters based on the histogram data:
- Include Filters: Add one filter per line with a slash
/
around it to match the URL pattern. There are no wildcards like*
. - Exclude Filters: Language filters are useful for exclusion; for instance, you might leave
/en/
or/en-us/
while excluding others.
Step 5: Finalizing Ingestion
Click Next
.
For large websites, the ingestion could take several hours. The platform adheres to the crawl delay set in the robots.txt
file or sets its own delay to prevent affecting website operations.
You will be notified by email when the website ingestion is complete.
Additional Information
- If no sitemap is available, the platform will perform a broad crawl of the website. This leads to the creation of a Histogram for filtering what is ingested.
- Including or excluding URL patterns streamlines the ingestion process by focusing on the most relevant URLs.
Usage Tips
- Use 'Include Patterns' and 'Exclude Patterns' to tailor the content being ingested.
- Patterns work by matching strings in the URL. For example, adding
/news/
to the exclude field means any URL containing this string will not be included.
If you don't find the notification in your inbox, be sure to check your junk or spam folder as well.
Add noreply@ai12z.com
to your email contact list.