Skip to main content

Website Ingestion Guide

This document provides a step-by-step guide on how to ingest a website into the ai12z AI platform. The process respects robots.txt directives and utilizes sitemap data to determine the URLs to ingest. This is the most efficient way to import your website content. The alternative way is to upload a JSON file of URLs, used if you only want some URLs of a website to be uploaded.

Step 1: Initiating Ingestion

To start the website ingestion process:

  1. Click on 'Add Website'.

  2. Fill in the dialog box with the required metadata about the site.

    The image is of a 'Sitemap Extraction' dialogue box within a web interface, asking for details to extract a website's sitemap. Fields include 'Name*' and 'URL*' as required inputs, and 'Description' for optional details about the URL. There's an unchecked option for 'Filter By Language', suggesting language-specific sitemap extraction. 'Include Patterns' and 'Exclude Patterns' fields are available, likely for specifying parts of the website to include or exclude from the sitemap. There is an 'Advance Settings' option that can be expanded, and 'Cancel' and 'Next' buttons are provided to either abort or proceed with the sitemap extraction process.

    Ensure the following details are provided:

    • Name: A recognizable name for the website within your project.
    • URL: The base address of the website you wish to ingest.
    • Description: A brief description of the website's content.
    • Include Patterns: Specify URL patterns you want to include. Use a forward slash / to denote the URL structure (e.g., /en/us/ to include only English US pages). -- Filter By Language For example if you select english then only english content will be ingested
    • Exclude Patterns: Specify URL patterns you want to exclude (e.g., /pages/ to exclude URLs containing this pattern).
    • Advance Settings: Request type should be set to Auto, and Force Crawl should be unchecked. The system automaically applies the appropriate settings. There could be a senario to use these, for example if using the older react for part of a site.

    The image displays an 'Advance Settings' section for a web interface, related to sitemap extraction or web crawling. There are options to select the 'Request Type', with 'Auto' currently chosen from a dropdown menu, and a checkbox for 'Force Crawl', which is not checked.

Step 2: Robots.txt

The platform will attempt to read the robots.txt file, which may include the initial sitemap or multiple sitemaps. If a crawl-delay is specified, the platform will adhere to it; otherwise, a default crawl-delay is used to avoid impacting website operations.

notification window for 'Sitemap Extraction', indicating that the extraction of the robots.txt file is in progress

Step 3: Sitemap Extraction

The platform will process the sitemaps, which could be multiple for large sites.

The image shows a 'Sitemap Extraction' status window indicating that 2 sitemaps have been ingested and there are 8 more sitemaps left to ingest.

A progress indicator will display during sitemap extraction. After about a minute, a dialog appears to notify you that this process could take a while. You can close your browser and log in later. An email notification will be sent to log in and go to the document action menu to click Continue Ingestion.

The image shows a confirmation message for a 'Sitemap Extraction' request within ai12z software application. The message includes a checkmark symbol indicating success, and text thanking the user for submitting a URL. It states that the request has been successfully received and the crawler is actively gathering information, a process that typically takes a few hours. The user is assured of notification once the crawling is complete and encouraged to reach out with further questions. Two buttons are provided at the bottom: 'Come back later' and 'Current status', offering options for the user to leave the process for now or check the ongoing status.

If you close the dialog and the system has not completed the sitemap extraction, you can always check the status by looking at the action bar for information or status.

'Action' column with a vertical ellipsis, indicating a dropdown menu. The menu is open, displaying three options: 'Info', 'Status', and 'Delete

Step 4: Reviewing the URL Histogram

After the initial collection of URLs, a histogram will be presented for review. If you closed the dialog, you can return to the action menu and select Continue Ingestion.

'Action' menu. The menu changes based on the process, displaying three options: 'Info', 'Continue Ingest'

You will receive an email reminder to take action and apply filters.

The image shows a 'Website Ingestion' dialog box with a histogram of URLs gathered. It features a section called 'URL Histogram' listing paths such as '/blogs/n' and '/calendar/event' with numbers in parentheses, indicating the number of pages. Below, there are input fields for 'Include Patterns' and 'Exclude Patterns', allowing the user to specify which parts of the website should be included in or excluded from the ingestion process. 'Back' and 'Next' buttons are available to navigate through the settings.

Users should apply filters based on the histogram data:

  • Include Filters: Add one filter per line with a slash / around it to match the URL pattern. There are no wildcards like *.
  • Exclude Filters: Language filters are useful for exclusion; for instance, you might leave /en/ or /en-us/ while excluding others.

Step 5: Finalizing Ingestion

Click Next.

For large websites, the ingestion could take several hours. The platform adheres to the crawl delay set in the robots.txt file or sets its own delay to prevent affecting website operations.

You will be notified by email when the website ingestion is complete.

Additional Information

  • If no sitemap is available, the platform will perform a broad crawl of the website. This leads to the creation of a Histogram for filtering what is ingested.
  • Including or excluding URL patterns streamlines the ingestion process by focusing on the most relevant URLs.

Usage Tips

  • Use 'Include Patterns' and 'Exclude Patterns' to tailor the content being ingested.
  • Patterns work by matching strings in the URL. For example, adding /news/ to the exclude field means any URL containing this string will not be included.
Check Your Email

If you don't find the notification in your inbox, be sure to check your junk or spam folder as well. Add noreply@ai12z.com to your email contact list.