CrawlerFile Schema v1.0 — Field Reference

Layer 1 — Envelope Fields

These fields appear in every CrawlerFile profile regardless of content. They describe the file itself, establish its provenance, and enable verification. All Layer 1 fields are required unless noted.

Field	Type	Required	Definition
schemaVersion	string	Required	The version of the CrawlerFile schema this file conforms to. Used by consumers to parse the file correctly. Current value: `"1.0"`
schemaDocs	url	Required	A URL pointing to the authoritative schema definition for the declared `schemaVersion`. Allows any consumer to look up field definitions without prior knowledge of CrawlerFile. Current value: `"https://crawlerfile.com/schema/v1"`
fileId	string	Required	A unique identifier for this specific file. Stable across updates — the same file retains its `fileId` when its content changes. Format: UUID.
entityId	string	Required	A unique identifier for the entity this file describes. Stable regardless of domain changes, rebranding, or acquisition. An entity may have multiple files (different `fileId` values) but always shares one `entityId`. Format: UUID.
entityDomain	string	Required	The primary registered domain of the entity (e.g. `"example.com"`). Used as a human-readable identifier and as the root for verification. Does not include protocol or path.
contentType	enum	Required	Describes what kind of data this file contains. Allows consumers to identify relevant files without parsing the full content. `fullProfile` — Complete entity profile combining all available sections `identity` — Basic entity identity only `services` — Products and services section only `people` — Leadership and personnel section only `policies` — Policy declarations only `aiPolicy` — AI-specific policy declarations only
publisher	string	Required	The domain of the platform that published this file. For all files published by CrawlerFile: `"crawlerfile.com"`
publishedDate	date	Required	The ISO 8601 date (`YYYY-MM-DD`) on which this version of the file was published or last updated. Consumers should use this to assess data freshness.
authorization	object	Required	An object containing all authorization and verification fields. Consolidating these fields makes the authorization chain explicit and easy for any crawler to locate. Contains the following sub-fields: `statement` — Plain-language declaration of authorization, readable by humans and AI systems `authorizedBy` — Legal name of the entity that authorized this profile `verificationMethod` — How authorization can be verified. Enum: `pageBased`, `metaTag`, `dnsTxt` `entityVerificationUrl` — Required when `verificationMethod` is `pageBased` or `metaTag`. Omit for `dnsTxt`. The URL where the token can be confirmed on the entity's own domain `verificationToken` — A UUID that appears both in this file and at the verification location on the entity's domain. Its presence in both locations confirms authorization

Layer 2 — Entity Identity

Core descriptive fields about the entity. These use Schema.org Organization vocabulary where applicable, extending it with additional fields where needed. All Layer 2 fields are optional — entities publish what is relevant and accurate for them.

Schema.org compatibility: CrawlerFile profiles include "@context": "https://schema.org" and "@type": "Organization" in the entity object. Fields that map directly to Schema.org properties use Schema.org field names. All field names use camelCase throughout.

Loose schema by design: Layer 1 (the envelope) is strictly defined and must be consistent across all profiles for verification and discovery to work. Layers 2 and 3 are intentionally flexible. Entities are encouraged to publish whatever data is accurate and relevant to them, using Schema.org vocabulary where it fits and plain descriptive field names where it does not. AI consumers are well-suited to interpret varied, expressive, human-readable field names and values — rigid field constraints are not required for machine comprehension.

Layer 3 — AI Policy

The aiPolicy object allows entities to formally declare their policies regarding the use of their data by AI systems. These declarations are timestamped and verifiable. While CrawlerFile cannot guarantee that all AI systems will honor these declarations, their publication creates a formal, machine-readable record of the entity's stated intent.

Field	Type	Required	Definition
trainingDataConsent	enum	Optional	Whether the entity permits the contents of this profile to be used as training data for AI/ML models. `permitted` — Unrestricted use for AI training `not_permitted` — Use for AI training is not authorized `conditional` — Permitted subject to conditions described in `notes` or at `aiInquiriesContact`
retrievalConsent	enum	Optional	Whether the entity permits this profile to be used in real-time retrieval systems (e.g. RAG pipelines, search-augmented AI responses). Distinct from training consent — an entity may permit retrieval but not training. `permitted` — Retrieval use is authorized `not_permitted` — Retrieval use is not authorized `conditional` — Permitted subject to conditions
requiresAttribution	boolean	Optional	Whether the entity requires attribution when its data is used or cited by AI systems. If `true`, the preferred attribution format should be provided in `attributionFormat`.
attributionFormat	string	Optional	The preferred attribution string to use when citing this entity's data. Only meaningful when `requiresAttribution` is `true`. Example: `"Multiplier Advisors (multiplieradvisors.com)"`
dataFreshnessSLA	enum	Optional	How frequently this profile is reviewed and updated by the entity. Consumers can use this to assess whether the data is likely still current. `realtime` — Continuously maintained `weekly` — Reviewed weekly `monthly` — Reviewed monthly `quarterly` — Reviewed quarterly `annually` — Reviewed annually `as_needed` — Updated when significant changes occur
expiryDate	date	Optional	ISO 8601 date after which consumers should treat this profile as potentially stale and seek a more current version. If omitted, the profile does not have a declared expiry.
aiInquiriesContact	string	Optional	An email address or URL where AI companies, researchers, or developers can direct inquiries about data use permissions for this entity.
notes	string	Optional	Free-text field for additional context about this entity's AI data use policies, particularly useful when any consent field is set to `conditional`.

Verification Protocol

The CrawlerFile verification protocol allows any crawler or consumer to independently confirm that the entity named in a profile has authorized its publication — without contacting CrawlerFile directly. Three verification methods are supported, declared in the authorization.verificationMethod field.

pageBased (recommended): The entity creates a page on their own domain containing the verificationToken as visible text. The entityVerificationUrl points to this page. Any crawler can fetch the page and confirm the token appears there. This is the recommended method — it requires no technical access beyond the ability to add a page, works on any website platform, and creates a human-readable public declaration of authorization.

metaTag: The entity places the verificationToken as a meta tag in the <head> of a page on their domain. Format: <meta name="crawlerfile-verification" content="[token]">. The entityVerificationUrl points to the page containing the meta tag, typically the homepage. Requires code injection access on the entity's website platform.

dnsTxt: The entity adds the verificationToken as a DNS TXT record on their domain. No entityVerificationUrl is needed — the crawler queries DNS for the entityDomain directly. This method is platform-independent and survives website redesigns, but requires access to DNS settings.

Step 01

Read the authorization object from the CrawlerFile profile. Note the verificationMethod and verificationToken.

Step 02

For pageBased or metaTag: fetch the entityVerificationUrl. For dnsTxt: query DNS TXT records for entityDomain.

Step 03

Confirm the verificationToken string appears in the fetched content. If it does, the entity has authorized this profile.

Versioning

CrawlerFile schema versions are identified by the schema_version field and documented at stable URLs of the form https://crawlerfile.com/schema/v{N}. This page documents v1.0.

Backwards compatibility: Minor updates to an existing version (clarifications, new optional fields) do not change the version number. Breaking changes — renamed required fields, changed enum values, structural changes — always increment the version. Existing files are never retroactively invalidated; older schema versions remain documented at their original URLs.