Article

Dataset SEO: The Schema Label Google Is Actually Looking For

My Joomla sites were in the Dataset report. My PHP/JSON sites weren't. The infrastructure was identical. The labels were missing. That's the whole story ... except for what we found while fixing it.

Krisada 7 min read 8 views

My Joomla sites were showing up in Google Search Console's Dataset report. My PHP/JSON flat-file sites weren't. Same structured data approach. Same AI foundation layer. Same /ai/catalog.json endpoints, same machine-readable JSON throughout. Different result in GSC. That's the kind of discrepancy that nags at you. Not urgent. Just wrong. Turns out it wasn't a data problem. It wasn't an architecture problem. The infrastructure was already right -- it was literally running as a dataset network. It was a labeling problem. Google couldn't see the datasets because I never formally declared them as datasets. And the fix was simpler than the problem had any right to be.

What Google's Dataset Report Actually Tracks

Before getting into the fix, it's worth being precise about what Google's Dataset report in Search Console is actually watching for.

It's not looking for data files. It's not looking at whether your site has JSON endpoints or CSV downloads or any particular file structure.

It's looking for @type:Dataset in your JSON-LD.

That's it. If your schema doesn't explicitly declare @type:Dataset somewhere, GSC has nothing to track. Your site is invisible to that report regardless of how data-rich the underlying architecture actually is.

This matters because it's counterintuitive. A site running entirely on flat JSON files, with a full /ai/ endpoint layer, with a catalog.json that indexes every piece of published content ... that site might have no Dataset schema at all. Which is exactly what I had.

The Joomla sites showed up because some extension was accidentally emitting recognizable Dataset patterns, or Google inferred them from feed structures. The PHP/JSON sites were more sophisticated architecturally and completely absent from the report. Go figure.

Dataset Schema Isn't Just for Data Scientists

Here's the reframe that unlocked the whole fix.

I'd been thinking about Dataset schema the way most SEOs do -- as something for academic data portals, government statistics, research institutions. The kind of site where someone actually downloads a spreadsheet.

That's too narrow.

Dataset schema is schema.org's formal wrapper for any structured collection of information. That definition covers a lot more ground than most people realize:

-- A content library organized by topic is a Dataset. -- A category of articles covering AI SEO is a Dataset. -- An /ai/catalog.json endpoint listing every published piece is a Dataset. -- The whole network of interconnected sites sharing structured data is a DataCatalog.

Once you make that shift, you stop asking 'do I have datasets?' and start asking 'which of my datasets haven't I labeled yet?'

For a flat-file PHP/JSON site, the answer is: probably all of them.

The Hierarchy That Actually Works

Schema.org gives you two types to work with here:

@type:DataCatalog -- the container. A named collection of datasets with a URL pointing to the machine-readable index.

@type:Dataset -- an individual dataset. Has a name, description, distribution (where to get the actual data), and can be nested inside a DataCatalog or inside a parent Dataset.

The right structure for a modern content site looks like this:

DataCatalog at the site root ... points to /ai/catalog.json as its URL ... appears in the base schema graph on every page so Google sees it everywhere.

Site-level Dataset ... the full content library ... linked to the DataCatalog via includedInDataCatalog ... distribution pointing to catalog.json for the actual machine-readable output.

Category/collection Datasets ... one per topic category ... each linked to its parent Dataset via isPartOf ... for nested categories, the chain runs all the way up. Sub-category links to parent category Dataset, which links to site Dataset, which links to DataCatalog.

Individual content type Datasets ... stat pages, data-heavy content ... these already existed in the schema setup. Now they have a catalog above them that makes sense.

The whole thing self-assembles if you build it at the right level. The DataCatalog and site Dataset go in the base graph -- they appear on every page automatically. Category Datasets get added in the category page schema builder, and the parent chain resolves itself from the category's own parent_slug metadata.

Once it's wired up, adding a new category means a new Dataset node appears automatically. No manual schema work per category.

FAQs: Not Dead, Just Repositioned

While we were in the schema system, a related question came up about FAQ schema -- Google killed the FAQ rich result in May 2026 and is removing the Search Console tracking in June.

Short answer: keep the FAQ schema. Don't touch it.

The FAQs were never just about the rich result. They're structural Q&A that skim readers can pull value from without reading the whole piece. And there's solid evidence that the additional entity and topic structure still signals to Google for ranking, even without the visual enhancement. They're probably downplaying it to tamp down the abuse that made every second page on the internet a fake FAQ block.

For AI systems specifically, the FAQ format maps directly to how they prefer to extract information -- clean question, clean answer, minimal interpretation required. Keep it.

The rich result was a nice bonus while it lasted. The underlying value was always in the structure.

What We Found While Fixing the Labels

The Dataset work turned up something adjacent that's worth calling out separately.

When we wired up the schema nodes to read site identity from a central config file instead of hardcoded strings, we found drift.

Three places declaring the same site name and description -- the PHP constants file, the schema node functions, and the site-settings JSON -- with three slightly different values. None of them wrong enough to cause obvious problems. All of them wrong enough to create inconsistency in what different AI systems were reading.

This is how schema rot happens. You update one thing, not the other two. Nobody notices because nothing breaks visibly.

The fix was to adopt the same pattern used on krisada.com -- a site_config() singleton that loads the canonical JSON file once and serves it to every function that needs it. Change the site description in one JSON file, and it propagates to the schema, the meta tags, the catalog nodes, everything.

This isn't a Dataset SEO issue specifically. It's the kind of drift that shows up in any system where the same value lives in multiple places. But we only found it because we were looking at the schema layer closely enough to notice.

Data-driven schema is more trustworthy than hardcoded schema. Full stop.

The Bigger Picture: Your Site Architecture Is Already a Dataset Network

Here's the thing I keep coming back to.

The PHP/JSON flat-file approach I use across the AI Digital Karma Federation wasn't designed to look like a dataset network. It was designed to be fast, maintainable, and machine-readable without a database layer.

But the architecture it produced -- articles in JSON, categories in JSON, taxonomy in JSON, AI foundation endpoints at /ai/ -- that IS a dataset network. It just wasn't labeled as one.

This is the real unlock of Dataset SEO for modern content sites:

You probably already have the infrastructure. The question is whether you've declared it in a way that machines -- including Google -- can formally recognize.

The catalog.json on every federation site is a Dataset. The category hierarchy is a nested Dataset structure. The /ai/manifest.json is a DataCatalog index. The architecture was right. The schema layer was missing.

Once you add the labels, Google can classify what you already built. Not just in the Dataset report -- but in entity understanding, knowledge graph alignment, and the citation eligibility that matters for AI-mediated search.

Schema.org Dataset and DataCatalog types are how you formally hand over your site's architecture to the machines that are trying to understand it. If you've built a structured, machine-readable content system and you're not using these types, you're leaving that handover incomplete.

The Label Is the Signal

Dataset SEO comes down to one shift in thinking.

Stop asking whether your site has structured data. Start asking whether you've formally declared the structure you already have.

If you're running a modern content site on any kind of JSON-based or structured data architecture, you almost certainly have datasets. They're your content library. Your topic categories. Your AI endpoint layer. Your catalog file.

The declaration is the schema. The schema is the signal. And the signal is what puts you in the Dataset report, the knowledge graph, and eventually the citation pool that AI systems pull from when composing answers.

The infrastructure was right. It was the labels that were missing.

Fix the labels.

Is Your Schema Hierarchy Complete?

DataCatalog, Dataset, category chain -- if you haven't audited your JSON-LD graph as a system, it probably has gaps.

Get in Touch
Content Lab

Explore Related Research

Browse our documented case studies, experiments, and concepts.