What is Microdata? A detailed look at Schema.org and Other Structured Data Types

The world of web development and SEO has been using “structured data” to embed semantically rich information into web content for many years now. (If you missed my earlier post about Microdata, you can catch it here…) Prior to 2011 or so, formats such as Microformats and RDFa were the most commonly found types of structured data. However, in 2011, the major search engines in the US threw their weight behind the developing format known as Microdata, which is supported by a vocabulary maintained at Schema.org. HTML 5, as well, is including Microdata as an ‘official’ format for structured data. These two events ensure Microdata’s continued importance and are the impulse behind this more comprehensive post about ‘what Microdata is and how it relates to existing types of structured data.’

What is Microdata and how does it relate to Microformats, RDFa , etc?

To get a handle on the relationship between Microdata, Microformats, RDFa and others, it’s easiest to recall a few names, as a simple reminder for how technical innovations usually evolve..

The Microsoft PC vs. The Apple Macintosh
Blue Ray vs. Toshiba’s HDVD
VHS vs. Beta
iPhone vs. Android

You’re probably getting my drift here…

The reason why understanding the differences between Microdata, Microformats and RDFa can seem so challenging is that, well, they’re all very similar in what they’re trying to do. All three are efforts to supply semantic, definition-based content to the machines reading your web page.

But let’s try to understand this in plain English (or plain HTML…)

Suppose I’m designing a website and I have recurring footer information which features the address of the website’s owner. I might mark up something like this:

[xml]
<footer>
<div class=”blanketyblank”>
<ul style=”list-style-type: none;”>
<li><a href=”http://www.widgetco.com/s.nl/it.I/id.2/.f”>The Widget Company</a></li>
<li>3929 Queen’s Corset Highway</li>
<li>Camelot, OH 55775</li>
<li>Telephone: 555-456-7890</li>
</ul>
</div>
</footer>
[/xml]

But, of course, other than the mildly semantic ‘footer’ tag used here, all this information means nothing to a search engine or any other machine. It’s all just a bunch of alphanumeric characters. Structured data formats, however, rely on static or evolving vocabularies of information and a basic syntax of categorization (for instance, RDFa uses what is called “subject-predicate-object” expressions to define markup) which actually allow a search engine to understand the semantic associations in a page’s content.

Addresses, for instance, were often previously marked up using one of the Microformats called hCard. Using hCard the code sample I just gave will look like this:

[xml]
<footer>
<div>
<ul style=”list-style-type: none;” class=”vcard”>
<li><a class=”fn org url” href=http://www.widgetco.com/s.nl/it.I/id.2/.f
title=”The Widget Company”>
<span class=”organization-name”>The Widget Company</span></a></li>
<li class=”extended-address”><span class=”street-address”>3929 Queen’s Corset Highway </span></li> <li class=”locality”>Camelot, <span class=”region”>OH, </span><span class=”postal-code”> 55775</span></li>
<li class=”tel”>Telephone: <span class=”value”>555-456-7890</span></li>
</ul>
</div>
</footer>
[/xml]

In hCard, specific information is designated as “class” information, which — assuming the machine can read hCard — provides the additional information necessary to interpret the code as representative of an actual organization, with an actual physical address.

So, then, you might be asking: “Why, if structured data formats already exist, should I care about Microdata and what makes Microdata the most relevant structured data markup for Web development today?”

The answer to that question is actually four-fold…

Although the power of Microformats is substantial, there are limitations to their use. For instance, there are currently several “flavors” of Microformats, each of which uses a limited vocabulary to convey certain types of information. For instance, for addressing information, one must use hCard. For recipes, one uses hRecipe. And so on and so forth. Besides having to learn each Microformat individually, there is a static vocabulary within each which is predefined. This is likewise a limitation.
Microdata, unlike its competitors, is now part of the HTML 5 specifications. This means, of course, that one day any HTML 5 compliant piece of software will natively understand Microdata. This is not true of Microformats and RDFa, which are either simply W3C recommendations or exist outside the specifications completely. In other words, in the world of semantic markup metadata, Microdata has essentially won the race to make it into the HTML spec first.
Last year, all 3 major search engines introduced support for Microdata.
In even bigger news, the 3 major search engines have teamed up and thrown their support around the development of a flexible and open-ended vocabulary for Microdata. Governed by the definitions set forth on schema.org this development is pretty revolutionary. Now, the vocabulary used for this sort of semantic markup information can evolve, as the world evolves. Using an open-ended vocabulary like schema.org, for instance, will allow designers to use the same language to semantically markup, for instance, books, movies and companies which didn’t exist at the time the vocabulary was first launched. This allowance for organic growth is a fundamentally unique aspect of Microdata.

Microdata in Practice: Using Microdata and Schema.org for Markup

Using Microdata will actually be pretty intuitive for most web developers. Developers with some experience in Microformats will find the transition even easier.

At a “big picture” level, the central paradigm of Microdata/Schema.org lies in the two-fold concept of items, containing groups of “name-value” pairs. Definitions for these items are explicitly defined, using a URL reference to the appropriate schema.org page. To make this explanation more “generic” it is important to remember that, while schema.org is the preferred vocabulary for the search engines, the Microdata format can support references to other vocabularies, as well. In fact, there were many vocabularies in use, prior to the adoption of schema.org by Google, Bing & Yahoo.

Let’s look at a couple of examples…

To get things started, take a look at the illustration below, which explains what we mean by the expressions “item,” “property” and “name-value pair.”

In the example above, the “item” consists of everything following the attribute “itemscope.” Properties for the item are defined using the “itemprop” attribute which, in this case, shows three different properties for, respectively, “name,” “band” and “nationality.”

There are many different sorts of properties and the schema.org vocabulary extends those possibilities even further. For instance, URLs can be properties, as can dates, times and other event indicators.

Here, for instance, is an ‘img’ tag marked up with a property:

[xml]
<div itemscope>
<img itemprop=”image” src=”http://imaginary_url.com/image.png” alt=”I’m an Image”>
</div>
[/xml]

And here is an example of markup with embedded date/time information: (Note: Most of these examples are originally modified from the W3C’s Editors’ Draft of Microdata.

[xml]
<div itemscope>
I was born on <time itemprop=”birthday” datetime=”2009-05-10″>May 10th 2009</time>.
</div>
[/xml]

There are no limitations on the number of properties items might have in Microdata and, in fact, it’s perfectly fine to markup items with multiple properties and values. For instance, the following is a single item, even though there are multiple properties listed as descendents of the initial itemscope.

[xml]
<div itemscope>
<p>My favorite flavors of ice cream include:</p>
<ul>
<li itemprop=”flavor”>Butter Pecan</li>
<li itemprop=”flavor”>Neopolitan</li>
</ul>
</div>
[/xml]

By default, descendents of an original itemscope attribute are associated with that original item. It is also possible to embed items within “top level” items. Or, in other words, an item’s properties can actually include other items. Consider this example:

[xml]
<div itemscope>
<p>Name: <span itemprop=”name”>Earl Fuller’s Famous Jazz Band</span></p>
<p>Band: <span itemprop=”band” itemscope> <span itemprop=”name”>Jazz Band</span> (<span itemprop=”size”>8</span> players)</span></p>
</div>
[/xml]

Here, the “top level” item has two properties which are “name” and “band.” The “band” property is an item in its own right, with two of its own properties, “name” and “size.”
There are other, more sophisticated ways to associates items and name/value pairs, including the use of the attribute itemref with element IDs. A more detailed discussion of these methods can be found in section 5.1.2.

Using Microdata and Schema.org: The Schema.org Vocabulary

As I mentioned previously, a big advantage of Microdata right now is the weight being thrown behind the development of Schema.org by the “Big Three” search engines. Schema is an effort to develop a single, unified and extensible vocabulary for Microdata markup, one which can grow as, basically, human knowledge grows. (Wow, that was a pretty heady sort of claim… But, really, that’s what all this stuff is about: creating a way to document human achievement and translate those achievements into machine readable format…)

Implementing Schema.org into your markup, then, is really just a matter of referencing types and properties, using the lexicon defined on schema.org. This lexicon arranges types hierarchically, in order of specificity. The best way to get a handle on this might be to take a look at this chart, a portion of which I’ve also screen captured below…

Putting it All Together

So, then, let’s take all this schema specific information and illustrate some markup…

For some reason, Google and the structured markup world seem to have had an eternal fetish with all things Hollywood. So I guess it should come as no surprise that one of the first categories of vocabularies to be given the most attention on schema.org is that portion of the vocabulary related to describing creative works, especially movies. Suppose we had the following description of a movie, written without any Microdata and wanted to mark it up semantically:

[xml]
<div>
<h1>The Avengers</h1>
<span>Director: Joss Whedon (born June 23, 1964)</span>
<span>Action Adventure</span>
<a href=” http://marvel.com/avengers_movie”>Trailer</a>
</div>
[/xml]
Without schema.org, we’d first start with just an itemscope attribute. The only thing we need to do differently using schema.org’s vocabulary, is to initially define the itemtype as one from the schema vocabulary. So, then, our first alteration to the code above looks like:
[xml]
<div itemscope itemtype=http://schema.org/Movie>
<h1>The Avengers</h1>
<span>Director: Joss Whedon (born June 23, 1964)</span>
<span>Action Adventure</span>
<a href=” http://marvel.com/avengers_movie”>Trailer</a>
</div>
[/xml]

Note that, in the above itemtype element, I jumped right down to the level of specificity I needed. In other words, “Movie” is actually a specific descendent item within the hierarchy beginning with “Thing.” I didn’t have to somehow or another display that hierarchy (“Thing>>CreativeWork>>Movie”) but, rather, just jumped right to the “Movie” item.

Next, I can add itemprop elements to my markup, to get even more semantic:

[xml]
<div itemscope itemtype=”http://schema.org/Movie”>
<h1 itemprop=”name”>The Avengers</h1>
<span>Director: <span itemprop=”director”>Joss Whedon</span> (born June 23, 1964)</span>
<span itemprop=”genre”>Action Adventure</span>
<a href=” http://marvel.com/avengers_movie” itemprop=”trailer”>
<br />Trailer</a>
</div>
[/xml]

Finally, just like we did in our non-schema.org example previously, we can embed items and properties within other items. In the markup below, we’re adding the “Person” itemtype to indicate that the Director of the film is, in fact, a “Person” and we add his date of birth as a property.

[xml]
<div itemscope itemtype=”http://schema.org/Movie”>
<h1 itemprop=”name”>The Avengers</h1>
<div itemprop=”director” itemscope itemtype=”http://schema.org/Person”>
Director: <span itemprop=”name”>Joss Whedon</span> (born <span itemprop=”birthdate”> June 23, 1964) </span>
</div>
<span itemprop=”genre”>Action Adventure</span>
<a href=” http://marvel.com/avengers_movie” itemprop=”trailer”>
<br />Trailer</a>
</div>
[/xml]

So, as we can see from the above examples, marking up Microdata within the Schema.org lexicon is essentially the same as marking up Microdata in its absence. The syntax is exactly the same; the only real difference is that one is referencing the preexisting Schema vocabulary.

Conclusion

When I sat down to write this post, I found an interesting question in Stack Overflow where someone had asked an editor to clarify the relationship between the various structured data formats for the web. Originally, the responder had given breakdowns of the three formats we discussed here but then, several months later (in late 2011) he added a big fat annotation, which read…

“Times have changed. Forget my recommendation below. Just use Microdata and forget that the other two exist.”

After I finished laughing, I thought about his suggestion and — for the newcomer to structured data — it makes perfect sense. The reality is that:

Microdata will be part of the HTML 5 standard moving forward. As a web developer, you may as well use it because it’s going to be standard mark-up soon.
The only reason you probably care about structured data in the first place is because it gives machines (read search engines) additional information about what the contents of any given page really means. And, since the big 3 search engines have all announced their standardization around Microdata, you’re going to have use it eventually for that reason, as well…

What is Microdata? A detailed look at Schema.org and Other Structured Data Types