Website Ingestion
ai12z enables you to ingest entire websites for your AI copilot. This process extracts and indexes web content so your assistant can answer questions based on up-to-date site data.
Step 1: Start Website Ingestion
-
Click Add Document and select Add Website.
-
Complete the form with the required information:
- Name: A recognizable label for this website source.
- URL: The root address of the site to ingest (e.g.,
https://www.example.com
). - Description: (Optional) Describe the site’s content or purpose.
- Ingest PDF: If checked, PDFs discovered in the sitemap or during crawling will also be processed.
- Show Histogram: When enabled, you’ll get a summary of URLs before ingestion starts, allowing you to review and filter what gets included.
- Filter By Language: (Optional) Restrict ingestion to a specific language.
- Include Patterns: Enter URL patterns (one per line, e.g.,
/en/
) to include in ingestion. - Exclude Patterns: Enter URL patterns to exclude (e.g.,
/archive/
).
Step 2: Configure Advanced Settings (Optional)
Expand Advanced Settings for additional controls:
-
Authentication Type:
None
for public sites (default).Basic
for username/password.Token
for token-based authentication (beta).
-
Request Type:
-
Auto
(default): ai12z chooses the best mode for each page. -
Synchronous
: Fastest, for traditional sites. -
Asynchronous
: Uses a headless browser, for sites that load data with JavaScript. -
Advanced Mode
: For sites with extra protection—additional charge per page. -
Headers: Rare to use, sometimes for security
-
-
Force Crawl: Ignores the sitemap and crawls the site. Use if the sitemap is missing or not being kept up to date. Also setting this will bypass Wordpress API ingestion. Wordpress sometimes has different content types that the API ingestion misses.
-
ai12z user-agent: Identify as
ai12zCopilot/1.0
if required by your site. If not checked a common user-agent is used.
Warning: Adjust advanced settings only if you understand the impact. Non-optimal settings can lead to incomplete or failed ingestions. Advanced Mode incurs an extra $0.0025 per page.
Step 3: Robots.txt & Sitemap Extraction
-
ai12z checks the
robots.txt
file and sitemaps for allowed paths and crawl delays. -
Progress is shown as the system parses sitemaps and prepares for ingestion.
If the process will take a while, you can close your browser or log out. You'll receive an email when it's ready for review.
Step 4: Review URL Histogram & Apply Filters
If you enabled “Show Histogram,” you’ll see a breakdown of collected URLs and their counts.
- Include Patterns: Add URL segments (e.g.,
/en/
) to restrict ingestion to those paths. - Exclude Patterns: List URL segments to omit (e.g.,
/test/
,/fr/
). - No wildcards; patterns are matched as substrings.
Step 5: Finalize and Ingest
- Review your filters and click Next to start ingestion.
- For large sites, ingestion can take hours. ai12z honors
robots.txt
crawl-delay or uses a safe default to avoid impacting site operations. - You'll be notified by email when ingestion is complete.
Monitoring & Managing Ingestion
-
Check Status: From the action menu, select
Status
to view progress orInfo
for details. -
Continue Ingest: If you paused at the histogram step, use
Continue Ingest
in the action menu after setting filters.
Additional Features
-
Meta Tags: If a page contains
<meta name="tags" content="tag1, tag2">
, these tags are added to the vector index for that URL. -
JSON-LD: If JSON-LD is present, it’s ingested with the page content. This can be configured in Agent settings (ingest only JSON-LD or both).
-
If no sitemap is found, ai12z will broadly crawl the site and build a histogram for you to review before data is saved.
Tips & Best Practices
- Use Include/Exclude Patterns to focus on relevant content and skip redundant or irrelevant sections.
- Filtering by language is highly recommended for multilingual sites—preferably include only one language per ingestion run.
- Always check your email (including spam/junk folders) for ingestion notifications from
noreply@ai12z.com
.
If you skip “Show Histogram,” the system ingests all discovered URLs immediately. ::