---
"@context": "https://schema.org"
"@type": "TechArticle"
"@id": "https://dasarakushi.com/notes/where-search-intent-lives"
headline: "One canonicalization rule for cannibalization, duplicate content, and crawl budget"
description: "A canonicalization rule for programmatic catalogs at scale. Canonicals identify entities; properties are handled by title."
author: { "@type": "Person", name: "Dasara Kushi", url: "https://dasarakushi.com" }
datePublished: "2026-05-05"
dateModified: "2026-05-08"
keywords: "canonicalization, programmatic SEO, faceted navigation, URL cannibalization, duplicate content, crawl budget, entity resolution"
_format: "text/markdown"
_canonical: "https://dasarakushi.com/notes/where-search-intent-lives"
---

# One canonicalization rule

> Fix cannibalization, duplicate/thin content, and crawl-budget waste with one decision rule: canonicals identify entities, and properties are handled by title.

Most catalog sites at scale publish more URLs than search intent justifies. The waste shows up as cannibalization — multiple pages competing for the same query, the site competing with itself, ranking signal pooling in the wrong place. The standard fixes (canonical everything to the parent, or make every facet self-canonical) are each wrong half the time, because the right answer depends on something the template doesn't know.

The right question: **for this URL, where does search intent actually live?** Intent lives at **entities**, not **properties**. If a URL identifies a distinct entity, it earns its own canonical. If it's only a property of an entity represented elsewhere, canonical it to the entity and let the title handle query-match. Apply that recursively at every facet level.

## 1. Domain one — people-search

A site indexing 100M+ records has thousands of John Smiths, with URLs like `/john-smith`, `/john-smith/california`, `/john-smith/california/los-angeles`.

- **Uncommon name (Rajesh Chatraptra):** one record in Maine, NY, and California is relocation, not three people. Entity resolution collapses them into one entity; the states are biographical *properties*; all state pages canonical to root.
- **Common name (John Smith):** one in Tucson, one in NY are distinct entities at different cities. Root is an aggregator; recurse down until a specific person is identified; each city URL self-canonical.

The unifying mechanic: **recursively disambiguate until an entity is identified, then stop — everything below that depth is a property of that entity.**

The naive version counts records per facet and canonicals to root when the count hits one. It gets the *Toronto case* wrong: John Smith in Toronto, Ontario vs. Toronto, Ohio are two distinct people at the same name×city string — each needs its own canonical. Count was a proxy for entity identity; condition on entity identity directly:

```
e_global = distinct_entities(name)         // entities after resolution
e_facet  = distinct_entities(name, state)  // entities at this facet

if e_facet == 0                     -> don't generate
if e_global == 1                    -> all facet URLs canonical to root
if e_facet == 1 and e_global >= 2   -> state URL self-canonical
if e_facet >= 2                     -> self-canonical, recurse into cities
```

## 2. Domain two — e-commerce variants

Same predicate: *does this URL identify a different entity from its parent canonical?* A shade (MAC Lip Liner Whirl) or a color (Leica Q3 black/silver) is a property -> canonical to the entity, title differentiates. A refurbished unit (different price, warranty, stock pool) is a distinct commercial entity -> self-canonical. Apple treats iPhone 14 256GB as a property; Best Buy treats it as a distinct SKU entity — both correct; the rule respects whichever entity granularity the catalog chose.

Where the entity boundary sits can also be a deliberate demand-driven SEO call: "red prom dress" starts as a property and gets promoted to an entity once GSC shows real striking-distance demand. The rule fires identically before and after; what moves is the entity boundary, not the rule.

## 3. The formula

For any node u in a faceted URL tree:

```
canonical(u) = u                     if intent_lives_at(u)
             = canonical(parent(u))  otherwise

intent_lives_at(u)  :=  NOT same_entity_as(u, canonical(parent(u)))
```

Two separated concerns: **direction** (where intent lives — universal recursive structure) and **detection** (does this URL identify an entity — domain-specific). Swap the entity definition, keep the rest. One predicate, one recursion, many domains.

## 4. Two layers, not one

**Canonicals identify entities. Titles identify properties of entities.** They use different signals because they answer different questions. Ulta publishes a per-shade URL (per-shade title, swatch, reviews) with `rel=canonical` pointing to the consolidated line page. "MAC lip liner" surfaces the line page (authority pools there); "MAC Whirl" surfaces the per-shade page (its title carries the property token). `rel=canonical` is a suggestion, so Google can route a property query to the property page even when the canonical points elsewhere.

## 5. Why it fixes three failure modes

All three have the same root cause — treating properties as entities:

- **Cannibalization** — property URLs canonical to their entity, so authority concentrates and URLs stop competing.
- **Duplicate/thin content** — every canonical URL has substantive content (the entity itself).
- **Crawl budget** — Googlebot's crawl signal concentrates on entities, not the long tail of property variants.

Apply it recursively across millions of URLs and a canonical-tag policy starts looking like the design of an indexable ontology. The work stays small: pick the entities; let the rest follow.

*By Dasara Kushi · May 5, 2026 · https://dasarakushi.com/notes/where-search-intent-lives*
