Website Ingestion

ai12z enables you to ingest entire websites for your AI copilot. This process extracts and indexes web content so your assistant can answer questions based on up-to-date site data.

Step 1: Start Website Ingestion

Click Add Document and select Add Website.
Complete the form with the required information:
- Name: A recognizable label for this website source.
- URL: The root address of the site to ingest (e.g., https://www.example.com).
- Description: (Optional) Describe the site’s content or purpose.
- Ingest PDF: If checked, PDFs discovered in the sitemap or during crawling will also be processed.
- Show Histogram: When enabled, you’ll get a summary of URLs before ingestion starts, allowing you to review and filter what gets included.
- Filter By Language: (Optional) Restrict ingestion to a specific language.
- Include Patterns: Enter URL patterns (one per line, e.g., /en/) to include in ingestion.
- Exclude Patterns: Enter URL patterns to exclude (e.g., /archive/).

Step 2: Configure Advanced Settings (Optional)

Expand Advanced Settings for additional controls:

Advanced Settings Panel

Authentication Type:
- None for public sites (default).
- Basic for username/password.
- Token for token-based authentication (beta).
Request Type:
- Auto (default): ai12z chooses the best mode for each page.
- Synchronous: Fastest, for traditional sites.
- Asynchronous: Uses a headless browser, for sites that load data with JavaScript.
- Advanced Mode: For sites with extra protection—additional charge per page.
- Headers: Rare to use, sometimes for security
Force Crawl: Ignores the sitemap and crawls the site. Use if the sitemap is missing or not being kept up to date. Also setting this will bypass Wordpress API ingestion. Wordpress sometimes has different content types that the API ingestion misses.
ai12z user-agent: Identify as ai12zCopilot/1.0 if required by your site. If not checked a common user-agent is used.

Warning: Adjust advanced settings only if you understand the impact. Non-optimal settings can lead to incomplete or failed ingestions. Advanced Mode incurs an extra $0.0025 per page.

Step 3: Robots.txt & Sitemap Extraction

ai12z checks the robots.txt file and sitemaps for allowed paths and crawl delays.
Progress is shown as the system parses sitemaps and prepares for ingestion.

If the process will take a while, you can close your browser or log out. You'll receive an email when it's ready for review.

Step 4: Review URL Histogram & Apply Filters

If you enabled “Show Histogram,” you’ll see a breakdown of collected URLs and their counts.

URL Histogram

Include Patterns: Add URL segments (e.g., /en/) to restrict ingestion to those paths.
Exclude Patterns: List URL segments to omit (e.g., /test/, /fr/).
No wildcards; patterns are matched as substrings.

Step 5: Finalize and Ingest

Review your filters and click Next to start ingestion.
For large sites, ingestion can take hours. ai12z honors robots.txt crawl-delay or uses a safe default to avoid impacting site operations.
You'll be notified by email when ingestion is complete.

Monitoring & Managing Ingestion

Check Status: From the action menu, select Status to view progress or Info for details.
Continue Ingest: If you paused at the histogram step, use Continue Ingest in the action menu after setting filters.

Additional Features

Meta Tags: If a page contains <meta name="tags" content="tag1, tag2">, these tags are added to the vector index for that URL.
JSON-LD: If JSON-LD is present, it’s ingested with the page content. This can be configured in Agent settings (ingest only JSON-LD or both).
If no sitemap is found, ai12z will broadly crawl the site and build a histogram for you to review before data is saved.

Tips & Best Practices

Use Include/Exclude Patterns to focus on relevant content and skip redundant or irrelevant sections.
Filtering by language is highly recommended for multilingual sites—preferably include only one language per ingestion run.
Always check your email (including spam/junk folders) for ingestion notifications from noreply@ai12z.com.

If you skip “Show Histogram,” the system ingests all discovered URLs immediately. ::

Step 1: Start Website Ingestion​

Step 2: Configure Advanced Settings (Optional)​

Step 3: Robots.txt & Sitemap Extraction​

Step 4: Review URL Histogram & Apply Filters​

Step 5: Finalize and Ingest​

Monitoring & Managing Ingestion​

Additional Features​

Tips & Best Practices​