Content is not Data
David NĂ¼scheler, CTO of Day Software and spec lead for the Java Content Repository specifications JSR 170 and 283, likes to say "everything is content." This is a bold statement that is intended provoke thought but I think that it is also a reaction to a prevailing view among technologists and database vendors that everything (including content) is just data. While it is true that content, when stored electronically, is just a bunch of 0's and 1's, if you think that content is just data, you need to get out of the server room because that is not how your users see it. There are four main reasons why.
- Content has a voice. Put another way, content is trying to communicate something. Like me writing this blog post, a content author tries to express an idea, make a point, or convince someone of something. Communication is hard and requires a creative process so authoring content takes much more time than recording data. Content is personal. If the author is writing on behalf of a company, there may need to be approvals to ensure the voice and opinion of the company is being represented. The author may refer to raw data to support his point, but he is interpreting. For example, even a graph of data may reflect some decisions about what data to include and how to show them. Because content has a voice, content is subjective. We consider the authority and perspective of the author when we decide whether we can trust it.
- Content has ownership. Data usually do not have a copyright but content does. The people who produce content, like reports, movies, and music, get understandably annoyed when people copy and redistribute their work. While data can be licensed, it is less common. Often data are distributed widely so that more people can provide insight into what they mean. Interestingly, when content is digitally stored as data on a disk, we think less about it in terms of content. For example, we are OK with data backups of copyrighted material even though creating copies is forbidden.
- Content is intended for a human audience. While content management purists strive for a total separation of content and presentation, content authors care about how content is being presented. They may have a lot of control over presentation and obsess over every line wrap or they only get to choose what words are bolded or italicized. They will only semantically tag a phrases in a document if they know that it will make for a richer experience for the audience. Presentation is not just for vanity's sake. Presentation, when done well, helps the audience understand the content by giving cues as to how things are organized and what is important. While the Semantic Web is all about machines understanding web content, at the end of the day, the machines are just agents trying to find useful information for human eyeballs (and eardrums). Content is authored with the audience in mind while data is just recorded.
- Content has context. In addition to who wrote the content, where it appears also matters. We care greatly how content is classified and organized because we want to make it easier to find. A database table doesn't care about the order of its rows (it is up to the application to determine how they should be sorted). Content contributors really care about where their assets fall in lists (everything from index pages to search results).
These distinctions may seem totally academic but I think they have real implications for the technologies that manage content. Because content is much more than "unstructured data," we can't think about the tools we use to manage and store it just in terms of big text fields in a relational database and forms to update these rows. Content is a personal experience for both the author and the audience and the technology that intermediates needs to be sensitive to that. Every once in a while there is a meme about "content management" becoming an irrelevant term because it will be subsumed into other more process or industry oriented disciplines. If that does happen, it is critical that certain content technology features and concepts carry over.
- Versioning. Content goes through a life cycle of evolution and refinement as groups of contributors work together to achieve the best way to convey the information and ideas. Some content assets (like policies and procedures) are updated hundreds of times over many years as information changes. Other assets go through many rapid iterations over a shorter period of time (such as an intensely negotiated contract). Often participants in a content life cycle need to know just what has changed. For example, a copyeditor can save time by just proofreading the changes since the previous copy edit. A translator may not need to re-translate an asset if only a minor edit was made. Sometimes the history of change can give insight into the spirit of meaning. Versioning is not just for reverting to older versions. A robust versioning system has features like version comparison and annotations.
- Control over the delivery. To effectively communicate, you need to tune your delivery to your audience. WYSIWYG editing and preview both try to give a content contributor the perspective of their audience. WYSIWYG editing gives a non-technical contributor some control over the styling over text. It is important that the WYSIWYG editor gives an accurate representation (as in the same CSS styles) of what a visitor will see. Single page preview puts the content into the context of a page by executing rendering logic. The more complex the rendering logic, the more difficult it is to control what the user sees. For example, if there is some logic to automatically display relevant related content, the preview environment has to have the same content, rendering code, and user session information as the production environment. Oftentimes, this is hard to do. I have had clients really struggle over controlling dynamic rendering logic. For example, a relevance engine automatically associated inappropriate images with articles or showed the same related content multiple times. Some users also like to see how articles show up on dynamic indices and search results. In these complex delivery tiers, preview is a lot more like software QA than simple visual verification - you need to test all the scenarios and parameters. A good practice is to delineate pages or sections that you want full editorial control over and other (less important sections) that are not worth the manual effort of controlling.
- Feedback. You can't communicate in a vacuum. You need feedback. However, most content contributors lob their content over the wall and then forget about it. When you are speaking in front of a group you can gauge reaction and make adjustments. As the web turns into a conversation, the content contributor needs to be listening as much as they are telling. Most content contributors underuse web analytics. The more accessible this information can be made, the better. Many web content management systems integrate analytics packages and have nice features like analytics overlays over rendered pages. However, these features are not used enough. More commonly, an analytics report will be circulated around to people who don't understand how to read it. Comments and voting can also be a powerful medium for adjusting and reacting to feedback either by direct response or by using knowledge of the audience in subsequent articles.
- Metadata. While metadata storage is trivial, capturing and using this information is a challenge. Metadata such as source and ownership are critical to tell the audience where the asset comes from (its voice and authority) and how it can be legally used. Metadata is also important for classification and context. Content contributors are notoriously bad at metadata entry: they either neglect or abuse it. Automation is part of the solution, but a good process involves humans with the responsibility for metadata (bring on the librarians!). The best way to leverage and exchange metadata is through standards based formats. Industry oriented formats (like NITF) are important because they have a standard set of metadata built in. Microformats are also useful for highlighting specific bits of standard information within rendered web pages. While most WCM platforms can produce these outputs through their templating tier, very few do any validation of the output. Reviewers just visually validate what they see on a preview page.
- Usability. Most of all the system needs to be easy to use. Creating content is hard work no matter how you do it. Any system that distracts or complicates a user from the creative process of developing content is bound to be un-popular and the first excuse for failure. The ideal content management system disappears from the user's consciousness by being familiar and frictionless - you don't need to think about it and it gives you immediate results. For many people, that is Microsoft Word (until Word tries to outsmart you and take over your document) and I have already mentioned the disturbing amount of web content that originates in MS Word. For some, blogging tools are approaching this level of usability. For others, in-context editing achieves it. In many cases users get so familiar with a tool that they forget they are using it even if the tool is hard to learn at first (I am reminded of this when my fingers just automatically type the right commands in vi). This usually only happens when you have specialists operating the CMS rather than a distributed authoring where all the contributors enter their own content.
If you are building an application that also needs to manage content, don't just think of the content in terms of CRUD for semi-structured data. Luckily, components and frameworks are available to incorporate into your architecture. The Open Source Web Content Management in Java report covers Alfresco, Hippo, and Jahia from this perspective. Recently, I have been playing around with the JCR Cup distribution of Day's CRX that bundles Apache Sling (very cool!). Commercial, back-end focused products like Percussion Rhythmyx and Refresh Software SR2 certainly play in this area. People used to deploy Interwoven Teamsite for this but I think it is too expensive to be used in this way. Bricolage is an open source back-end only WCM product written in Perl. But accurate preview and content staging can be complicated in decoupled architectures. Drupal and Plone are also quite popular as content centric frameworks for building applications but they tend to dominate the overall architecture (unless you use Plone with Enfold Entransit).
You have plenty of options that will allow you to avoid brewing your own content management functionality. Consider them!
Labels: commentary





11 Comments:
Each of the four things you say content 'has' are external to the content:
- Content has a voice... the person who created the content may be trying to communicate something, but content is inert, and does not 'try' to do anything
- Content has ownership... ownership is a social convention and not inherent to content
- Content is intended for a human audience... same thing as the first point - the intention is in the human, not the content
- Content has context... everything ha scontenxt, not just content. Context is the environment in which content finds itself, both historically and in the present moment. By definition, context is external to content.
Significantly, if the things that distinguish content from data are all external to content, it follows that content is not inherently distinct from data, but becomes distinct only through out attitudes toward it and the history of its use.
What's that? I can't hear you over the noise of all these servers!
Seriously, the point of this post is to get beyond the logical/physical storage aspects of content (which is what we tend to dwell on as technologists) and focus on what content means to users. Content is the expression and communication of information. This is significant because the tools that manage content need to be designed with an awareness that they will be used to intermediate in a conversation between human speakers and audiences about things that they care about.
The data (as in 010101011010) might be inert but the spirit that is captured and perceived hopefully isn't.
> Content is the expression and communication of information.
So are you saying 'content' is a verb? That doesn't make sense.
What's wring with saying "content is data *used* for the expression and communication of information?"
Because - from what I'm reading here - content *is* data, just data used in a specific way.
I was just getting all revved up to write a response about the connotations of "data" (scientific, objective, dispassionate, quantitative, point-in-time, graph-able, irrefutable, etc.) but I lost steam because it appears that you and I are the only ones that care about this semantic distinction. I would rather argue whether the word is pronounced "day-tah" or "dah-tah" and I don't much want to do that either.
The purpose of the post was about building systems to help users manage the output of a creative process.
I completely agree with the notion that content is not data. It is still something I have to explain to people who haven't worked with content management systems before, and it is still a little difficult to do.
However, I'll take it one step further and say that code is also not content.
I touched on this same topic briefly in last blog post.
http://blogs.citytechinc.com/sjohnson/?p=16
Shane
With all due respect, where is this all going?
I remember sitting at lunch at cmf in Denmark with David NĂ¼scheler and talking about what I didn't like about JSR 170. And telling him "it treats content like data"... He didn't quite understand what I meant -- most probably because I wasn't very good at explaining. Thanks for writing that up so much more eloquently Seth ;)
(Btw, that's not to put down the standards -- they're very welcome and the effort is appreciated).
The real question is -- one of the holy grails of content management -- when do content or data become information? And how would a CMS help you achieve that? It takes more than just efficient or interoperable storage.
(I could go on, but this is just a comment ;)
Thank you for a very well written and easily understood article. I have a fairly high IQ, but I've been over to Mr. Downe's blog and ... well never mind.
I think it comes down to architecture. Say you are a company that sells products in stores and online. You have the notion of a product.
This includes attributes such as id, title, and price. These attributes are data in my opinion. They are likely used by multiple systems at the database level.
Then we have attributes such as an html description, jpef image, and pdf documentation. These things, in my mind, are content. They should not be considered critical to the business. If they were removed and the website brought down. The notion of a product should still exist.
Also, reporting on content just doesn't feel right the way reporting on data does.
Congratulations to Seth for an excellent write-up. Brilliant.
It is very exciting to read the differences between "Content" and "Data" explained (which I of course see as "feature additions" or "improvements").
I think this is a landmark post and I will certainly refer to it in my future presentations. Very good stuff.
Hi Shane,
I think that "Product Information" with Id, Prices etc. is ideal to be stored in a content repository along-side with all the other "html", "jpegs", etc...
What information is important to the business is a "business call".
In many projects I witness that developers (instead of "business people") make sort of an arbitrary judgment call of what's important to the business and what not.
Let me give you a recent real-life
example that I dealt with at a banking customer of ours. The bank has a CMS and an e-banking application. As usual, they get treated differently from an uptime requirements perspective. This of course is wrong since the "login screen" of the banking app is on a "content page".
Now if your CMS is down, there are no transactions going through, just because people cannot login, since in real-life they will have a bookmark to the login page or even go through the main homepage and click their way to "ebanking" login.
The same of course holds true (even more so) for the "Product Catalog" application that you bring up.
If the customers can't browse the HTML and JPEGs for your products, it doesn't matter that the shopping cart application is still up and running, since nobody will be able
to add something to the cart.
I certainly think that a lot of the hard-structured (traditionally db oriented data) would very much benefit from features like versioning or access control. Product Information is just a very good example where "content services" definitely make a lot of sense.
regards,
david
Post a Comment
Links to this post:
Create a Link
<< Home