Creating and Uploading a Custom Sitemap
Overview
Use a custom sitemap when you need to batch upload URLs and their associated content. While the "Add Website" feature supports filtering and is typically more efficient, the custom sitemap method is beneficial for specific bulk upload needs.
Sitemap JSON Format
This JSON format lets you upload web URLs, functioning like a sitemap. Once uploaded, the system ingests all the web content linked to these URLs.
Example JSON:
{
"type": "webLinks",
"crawlDelay": 2,
"forceMode": "auto",
"lang" : ["en"],
"urls": [
"https://example.com/page1",
"https://example.com/page2"
"https://example.com/fileName.pdf"
// More URLs
]
}
Parameters
type
: Must be set towebLinks
.crawlDelay
(Optional): Sets the delay between crawls. Default is 2 seconds, which is also the minimum value.forceMode
(Optional): Options areauto
,sync
, orasync
. Default isauto
. In most cases (99%), it should be set toauto
.lang
(Optional): Example ['en'] 2 letter language code, array. default is [] all languages. Sometimes you want to filter out all other languages but 1
Python Helper Code to Generate JSON from CSV
The following Python script converts a CSV file containing URLs into the required JSON format for the custom sitemap.
import csv
import json
# Path to CSV file
csv_file_path = "path/to/your/csvfile.csv"
# Read CSV file and convert to JSON format
json_data = {"type": "webLinks", "crawlDelay": 2, "forceMode": "auto", "urls": []}
with open(csv_file_path, mode='r', encoding='utf-8-sig') as file:
csv_reader = csv.DictReader(file)
for row in csv_reader:
url = row["URL"]
if "uas" in url: # Filter condition
json_data["urls"].append(url)
# Save JSON data to a file
json_file_path = "path/to/your/jsonfile.json"
with open(json_file_path, 'w', encoding='utf-8') as jsonfile:
json.dump(json_data, jsonfile, ensure_ascii=False, indent=2)
print("JSON file saved at:", json_file_path)
Note:
- Ensure the CSV file has a column named "URL" containing the URLs.
- Adjust the
csv_file_path
andjson_file_path
as needed. - The filter condition in the script (
if "uas" in url
) should be modified according to your specific requirements.