Website Ingestion Guide
This document provides a step-by-step guide on how to ingest a website into the ai12z AI platform. The process respects robots.txt
directives and utilizes sitemap data to determine the URLs to ingest. This is the most efficient way to import your website content. The alternative way is to upload a JSON file of URLs, used if you only want some URLs of a website to be uploaded.
Step 1: Initiating Ingestion
To start the website ingestion process:
-
Click on 'Add Website'.
-
Fill in the dialog box with the required metadata about the site.
Ensure the following details are provided:
- Name: A recognizable name for the website within your project.
- URL: The base address of the website you wish to ingest.
- Description: A brief description of the website's content.
- Include Patterns: Specify URL patterns you want to include. Use a forward slash
/
to denote the URL structure (e.g.,/en/us/
to include only English US pages). -- Filter By Language For example if you select english then only english content will be ingested - Exclude Patterns: Specify URL patterns you want to exclude (e.g.,
/pages/
to exclude URLs containing this pattern). - Advance Settings: Request type should be set to Auto, and Force Crawl should be unchecked. The system automaically applies the appropriate settings. There could be a senario to use these, for example if using the older react for part of a site.
Step 2: Robots.txt
The platform will attempt to read the robots.txt
file, which may include the initial sitemap or multiple sitemaps. If a crawl-delay is specified, the platform will adhere to it; otherwise, a default crawl-delay is used to avoid impacting website operations.
Step 3: Sitemap Extraction
The platform will process the sitemaps, which could be multiple for large sites.
A progress indicator will display during sitemap extraction. After about a minute, a dialog appears to notify you that this process could take a while. You can close your browser and log in later. An email notification will be sent to log in and go to the document action menu to click Continue Ingestion
.
If you close the dialog and the system has not completed the sitemap extraction, you can always check the status by looking at the action bar for information or status.
Step 4: Reviewing the URL Histogram
After the initial collection of URLs, a histogram will be presented for review. If you closed the dialog, you can return to the action menu and select Continue Ingestion
.
You will receive an email reminder to take action and apply filters.
Users should apply filters based on the histogram data:
- Include Filters: Add one filter per line with a slash
/
around it to match the URL pattern. There are no wildcards like*
. - Exclude Filters: Language filters are useful for exclusion; for instance, you might leave
/en/
or/en-us/
while excluding others.
Step 5: Finalizing Ingestion
Click Next
.
For large websites, the ingestion could take several hours. The platform adheres to the crawl delay set in the robots.txt
file or sets its own delay to prevent affecting website operations.
You will be notified by email when the website ingestion is complete.
Additional Information
- If no sitemap is available, the platform will perform a broad crawl of the website. This leads to the creation of a Histogram for filtering what is ingested.
- Including or excluding URL patterns streamlines the ingestion process by focusing on the most relevant URLs.
Usage Tips
- Use 'Include Patterns' and 'Exclude Patterns' to tailor the content being ingested.
- Patterns work by matching strings in the URL. For example, adding
/news/
to the exclude field means any URL containing this string will not be included.
If you don't find the notification in your inbox, be sure to check your junk or spam folder as well.
Add noreply@ai12z.com
to your email contact list.