Schema reference
This document defines the fields used in CrawlerFile entity profiles. Every profile published by CrawlerFile.com conforms to this specification. This page is the authoritative definition of all field names, types, permitted values, and intended meanings.
Contents
Layer 1 — Envelope Fields Layer 2 — Entity Identity Layer 3 — AI Policy Verification Protocol VersioningThese fields appear in every CrawlerFile profile regardless of content. They describe the file itself, establish its provenance, and enable verification. All Layer 1 fields are required unless noted.
| Field | Type | Required | Definition |
|---|---|---|---|
| schemaVersion | string | Required | The version of the CrawlerFile schema this file conforms to. Used by consumers to parse the file correctly. Current value: "1.0" |
| schemaDocs | url | Required | A URL pointing to the authoritative schema definition for the declared schemaVersion. Allows any consumer to look up field definitions without prior knowledge of CrawlerFile. Current value: "https://crawlerfile.com/schema/v1" |
| fileId | string | Required | A unique identifier for this specific file. Stable across updates — the same file retains its fileId when its content changes. Format: UUID. |
| entityId | string | Required | A unique identifier for the entity this file describes. Stable regardless of domain changes, rebranding, or acquisition. An entity may have multiple files (different fileId values) but always shares one entityId. Format: UUID. |
| entityDomain | string | Required | The primary registered domain of the entity (e.g. "example.com"). Used as a human-readable identifier and as the root for verification. Does not include protocol or path. |
| contentType | enum | Required | Describes what kind of data this file contains. Allows consumers to identify relevant files without parsing the full content.
fullProfile — Complete entity profile combining all available sectionsidentity — Basic entity identity onlyservices — Products and services section onlypeople — Leadership and personnel section onlypolicies — Policy declarations onlyaiPolicy — AI-specific policy declarations only
|
| publisher | string | Required | The domain of the platform that published this file. For all files published by CrawlerFile: "crawlerfile.com" |
| publishedDate | date | Required | The ISO 8601 date (YYYY-MM-DD) on which this version of the file was published or last updated. Consumers should use this to assess data freshness. |
| authorization | object | Required | An object containing all authorization and verification fields. Consolidating these fields makes the authorization chain explicit and easy for any crawler to locate. Contains the following sub-fields:
statement — Plain-language declaration of authorization, readable by humans and AI systemsauthorizedBy — Legal name of the entity that authorized this profileverificationMethod — How authorization can be verified. Enum: pageBased, metaTag, dnsTxtentityVerificationUrl — Required when verificationMethod is pageBased or metaTag. Omit for dnsTxt. The URL where the token can be confirmed on the entity's own domainverificationToken — A UUID that appears both in this file and at the verification location on the entity's domain. Its presence in both locations confirms authorization
|
Core descriptive fields about the entity. These use Schema.org Organization vocabulary where applicable, extending it with additional fields where needed. All Layer 2 fields are optional — entities publish what is relevant and accurate for them.
"@context": "https://schema.org" and "@type": "Organization" in the entity object. Fields that map directly to Schema.org properties use Schema.org field names. All field names use camelCase throughout.
The aiPolicy object allows entities to formally declare their policies regarding the use of their data by AI systems. These declarations are timestamped and verifiable. While CrawlerFile cannot guarantee that all AI systems will honor these declarations, their publication creates a formal, machine-readable record of the entity's stated intent.
| Field | Type | Required | Definition |
|---|---|---|---|
| trainingDataConsent | enum | Optional | Whether the entity permits the contents of this profile to be used as training data for AI/ML models.
permitted — Unrestricted use for AI trainingnot_permitted — Use for AI training is not authorizedconditional — Permitted subject to conditions described in notes or at aiInquiriesContact
|
| retrievalConsent | enum | Optional | Whether the entity permits this profile to be used in real-time retrieval systems (e.g. RAG pipelines, search-augmented AI responses). Distinct from training consent — an entity may permit retrieval but not training.
permitted — Retrieval use is authorizednot_permitted — Retrieval use is not authorizedconditional — Permitted subject to conditions
|
| requiresAttribution | boolean | Optional | Whether the entity requires attribution when its data is used or cited by AI systems. If true, the preferred attribution format should be provided in attributionFormat. |
| attributionFormat | string | Optional | The preferred attribution string to use when citing this entity's data. Only meaningful when requiresAttribution is true. Example: "Multiplier Advisors (multiplieradvisors.com)" |
| dataFreshnessSLA | enum | Optional | How frequently this profile is reviewed and updated by the entity. Consumers can use this to assess whether the data is likely still current.
realtime — Continuously maintainedweekly — Reviewed weeklymonthly — Reviewed monthlyquarterly — Reviewed quarterlyannually — Reviewed annuallyas_needed — Updated when significant changes occur
|
| expiryDate | date | Optional | ISO 8601 date after which consumers should treat this profile as potentially stale and seek a more current version. If omitted, the profile does not have a declared expiry. |
| aiInquiriesContact | string | Optional | An email address or URL where AI companies, researchers, or developers can direct inquiries about data use permissions for this entity. |
| notes | string | Optional | Free-text field for additional context about this entity's AI data use policies, particularly useful when any consent field is set to conditional. |
The CrawlerFile verification protocol allows any crawler or consumer to independently confirm that the entity named in a profile has authorized its publication — without contacting CrawlerFile directly. Three verification methods are supported, declared in the authorization.verificationMethod field.
verificationToken as visible text. The entityVerificationUrl points to this page. Any crawler can fetch the page and confirm the token appears there. This is the recommended method — it requires no technical access beyond the ability to add a page, works on any website platform, and creates a human-readable public declaration of authorization.
verificationToken as a meta tag in the <head> of a page on their domain. Format: <meta name="crawlerfile-verification" content="[token]">. The entityVerificationUrl points to the page containing the meta tag, typically the homepage. Requires code injection access on the entity's website platform.
verificationToken as a DNS TXT record on their domain. No entityVerificationUrl is needed — the crawler queries DNS for the entityDomain directly. This method is platform-independent and survives website redesigns, but requires access to DNS settings.
Step 01
Read the authorization object from the CrawlerFile profile. Note the verificationMethod and verificationToken.
Step 02
For pageBased or metaTag: fetch the entityVerificationUrl. For dnsTxt: query DNS TXT records for entityDomain.
Step 03
Confirm the verificationToken string appears in the fetched content. If it does, the entity has authorized this profile.
CrawlerFile schema versions are identified by the schema_version field and documented at stable URLs of the form https://crawlerfile.com/schema/v{N}. This page documents v1.0.