Thoughts on Content Management and Open Source.

Wednesday, January 25, 2006

Good story about metadata capture

Lars Plougman's story about a lawyer's scheme to work around a document management system points out a huge issue in content management. The natural tension between creators and consumers of content. Content creators create content for their own purposes and don't like to spend the extra effort to add metadata that is only useful to other users and systems. In fact, I often distrust voluntary metadata because, as as I mention in an earlier post, users either neglect or abuse metadata. Consumers of content, however, benefit from good metadata because they allow readers to quickly identify content that is useful to them.

It is interesting that the solution that Lars' story presents uses a third party intermediary to take the content and properly enter it into the system with the correct profile information. The more I think of it, the more I am thinking that this is a good idea and it actually maps to something that I do at Optaros.

We use Subversion as our source control system and have set up a SVN commits mailing list (more about that here). As I read the checkin emails, in addition to seeing what people check in, I can see who has been naughty and nice with their checkin comments (which are metadata!). If someone gets lazy with the comments, I give them a friendly reminder. Otherwise, it would not be until much later, when we need to roll back some code, that we would realize that the comments were not helpful (I know whoever is reading this is thankful that they don't work with me!). Usually just the knowledge that people are looking makes contributors more meticulous about their comments.

The most common way to enforce metadata is through input validation (like in the case of the lawyer's document management system). Users hate it and that is the first thing people point to when they talk about how unusable their system is. It strikes me that making content creators write their own metadata is relatively new and maybe it doesn't work. Formal publishing environments typically have authors write stories and editors compose headlines and place stories. Lawyers are usually pampered with admin staff so they historically have been removed from the mechanics of creating documents and metadata (although that is changing). The librarian profession specializes in metadata and organization. Is it unrealistic to expect content creators to manage their own metadata? Or should we return to inserting metadata specialists into the process after content creation, either as police like me with my commits list, or as metadata editors like the secretary in Lars' story. Can we afford to do that? Can we afford not to?

I am looking forward to seeing your comments.


Monday, January 23, 2006

Content Management Problems and Open Source Solutions

[2/15/2008 Update: If you want a more up to date view of the marketplace, consider buying Open Source Web Content Management in Java.]

[2/25/2007 Update: The original whitepaper has been taken down from the Optaros site. It is somewhat out of date but there still seems to be a considerable amount of demand for it. You can now download it here]



[9/29/2006 Update: I have started to add updated reviews on this blog. Here is an updated and more complete review of eZ publish.]



[My apologies.... With the re-release of our website, the back door that allowed access to the white paper without registration has been closed. Fortunately, this white paper has been published under the Creative Commons 2.5 license so there are other copies floating around on the web. Maybe under Google keywords Seth Gottlieb CMS Whitepaper?. Also I have also seen some of the individual project reviews reposted on SWiK. So, if you like your information free (as in "libre", not "free weekend at a timeshare promotion") you might consider checking these alternative sites.]

I just published an epic whitepaper where I discuss how selecting an open source CMS is different than a proprietary software selection and summarize 15 open source projects:

  • Alfresco
  • Bricolage
  • Drupal
  • eZ publish
  • Lenya
  • Mambo/Joomla
  • MediaWiki
  • Midgard
  • OpenCMS
  • phpBB
  • Plone
  • Roller
  • Twiki
  • TYPO3
  • Zope CMF
The summaries are not to the depth that you would find in the CMS Report (which reviews Midgard, OpenCMS, Plone, and Zope) but they give a high level view of what the project is about.

Here is the abstract:

The open source community has produced a number of useful, high quality content management systems which presents an opportunity to deliver tailored content management solutions without the high licensing or management fees associated with commercially-licensed or hosted software. However, the sheer number of open source CMS projects and the ineffectualness of traditional commercial software selection techniques can make the task of finding the right open source software an intimidating challenge. The strategy of using feature matrices is particularly ill-suited to open source software selection. A more practical approach is to match your needs to a common business problem that others have solved using open source software and engage with the community to learn about their experiences in implementing the solution. Doing so will take advantage of the unique aspects of open source software: the openness of the user community and the transparency of the development process.

The content management use cases that are particularly well served by open source are: informational websites, online periodicals, collaborative workspaces, and online communities. This paper briefly describes some open source projects that have been successfully applied to support these use cases and gives techniques for how to engage with the community. While open source is frequently and successfully used as an alternative to custom development of unique solutions, the use of open source software will be the topic of another white paper or case study.
Feedback is welcome in the comments. If you want more information, feel free to contact me directly.

Labels: , , ,

Migration tools are marketing, not technical, tools

Jeff Potts, in his blog ECM Architect makes an excellent point about how migration tools are usually over-sold by software marketing organizations. Jeff rightly says

The *problem*” is, Notes/Domino [and I would extend this to any real CMS platform] is often used to develop complex, highly-customized applications. For those, there's probably no getting around a complete re-development effort if they are to be moved at all.

The best one can hope to do is migrate the content and success will depend on the quality of the content. If the content is well structured, it is often possible to use automated migration logic to map fields from one repository to another. If the content is just free text, the best you can do with automation is to push the content into an equally unstructured system. Of course, then your pristine new system will be as cluttered as your original system - probably more-so since much of the organization that was implied through navigation will be lost.

Not to scare you, but if you are switching platforms, be prepared for the cost of migrating your data, as well as training and loss of productivity. James Robertson of Step Two sets some realistic expectations in his article Spending Patterns During CMS Implementation. On the bright side, this work often forces critical business decisions and activities that directly improve the way content is managed within an organization. For example, deciding how long to keep around content, eliminating redundant or outdated content, organizing content more effectively, establishing ownership and processes for content.... If you want to be successful in this initiative, you would do this work anyway. Otherwise many of the problems that hampered the original system would be ported to the new system where they would be equally problematic.

Sunday, January 22, 2006

CM Professionals Gets Two New Board Members

CM Professionals recently held elections to replace two outgoing members of the Board of Directors (Ann Rockley and Frank Gilbane). The two new board members are Scott Abel and Mary Laplante. Both Scott and Mary have a long history of contributing to CM Professionals and I am looking forward to working with them as board members. I also want to recognize Ann and Frank for their great leadership and dedictation to the organization.

The CM Professional's management committee also got some new blood in this year's election. Janus Boye is our new Director of Member Relations and Mollye Barrett is the new Director of Communications. Welcome Janus and Mollye!

Labels:

Friday, January 20, 2006

US pays to make open source safer

The U.S. Department of Homeland Security has recently pledged one million dollars (read in a Dr. Evil voice) to fix security bugs in open source projects like Linux, Apache and Mozilla. Stanford University and Symantec are going to do the work. On the one hand, I think that is a nice (but token) gesture of support for open source as a national (dare I say planetary?) asset. On the other hand, from a security standpoint, I would say that open source already has a distinct advantage over proprietary software because there are more people looking at the code and its flaws ("Given enough eyeballs, all bugs are shallow"). For example, I would not vote on an electronic voting system whose source code was not exposed to public scrutiny. So why single open source software out? I wonder what the government can do to make proprietary software more reliable and secure because, if you look at the security alerts, that is where the majority of problems seem to be. on the other hand, I am not sure that I want the government to have any influence over code that I cannot see in light of the recent trends regarding privacy.

Wednesday, January 18, 2006

Two very good posts on the death of Enterprise Software

Joe Lamantia recently posted two excellent articles on how "Enterprise Software" is losing touch with the real business problems and being displaced by more agile targeted technologies. You can read the posts here and here. I could not have made these points better myself (although if you have been reading this blog for while, you know that I have tried).

Browser support

Someone recently pointed out to me that my blog layout does not display correctly on MSIE6 and I am embarrassed to say that I didn't check the other browsers when I applied the new style. I should have because 29% of my last 100 visits have been by people using IE6. In Steve Zimmerman's defense, the problem is not really with the template. The non-wrapping code samples in my "ZOracle" posts make the right column shift down to the bottom where it is safe. The Gecko engine, used by Mozilla and Firefox, allows the code samples to run over the right column. For now, I think I am going to leave it. MSIE users will have to scroll down to the bottom for the navigation until the ZOracle posts fall out of scope. Untill then, have I ever told you about a great little browser called Firefox?

Tuesday, January 17, 2006

Managing Metadata

I just read this post by Stefano Mazzocchi that discusses the difficulty of merging metadata.


One thing we figured out a while ago is that merging two (or more) datasets with high quality metadata results in a new dataset with much lower quality metadata. The "measure" of this quality is just subjective and perceptual, but it's a constant thing: everytime we showed this to people that cared about the data more than the software we were writing, they could not understand why we were so excited about such a system, where clearly the data was so much poorer than what they were expecting.

Stefano speaks mainly from a Semantic Web perspective but his observations are very relevant to content management and aggregating content from multiple sources. Right now the general business world is very far behind the community in which Stefano works (librarians, which you could say are metadata professionals). Our users struggle to invest any time author good metadata. But once we finally get them to truly focus on the metadata (or automate them out of the process), hopefully library science and the semantic web will have solved the issues and nuances of when you have good metadata and are ready to really use it.

Thursday, January 12, 2006

Karim Lakhani on The Mozilla Corporation

Karim Lakhani wrote an interesting article about the Mozilla Corporation, a wholly owned, "taxable," subsidiary of the Mozilla Foundation. Karim is serving on the advisory committee to help Mozilla how to figure it all out. The article describes how Mozilla Corporation has an opportunity to become a new kind of software company focused on "social responsibility, profitability and community purpose." Of course, corporate buyers will focus on the price and quality of the product, but the social/community angle will help foster good will and contribution from the development community. Who knows? Maybe the Mozilla Corporation will become the Ben and Jerry's of software.

Wednesday, January 11, 2006

Goodbye ZServer, Hello Twisted

Eric Shea just passed on this article describing how Zope3 is dumpinjg the old ZServer and go with The Twisted Framework for its web server.

For those of you unfamiliar with the Zope platform, ZServer is ancient, slow and not too secure. The security part is not so much of a problem because any serious Zope 2.x installation is going to sit behind an Apache web server. Twisted, on the other hand is a high performance framework for building Python applications. We did a prototype based on Twisted as part of a proposal for a very high traffic web service and our experience with it has been very positive. You may also notice the last name of one of the Twisted project team (Lefkowitz). Glyph is the son of r0ml, our former VP, Research and Executive Education. I am pretty sure that the Lefkowitz family enforces a strict nickname policy.

New look

I was planning to update the look of this site as a surprise for my 100th post (not too far away). However, I just couldn't bear the look of it anymore. My colleague, Steve Zimmerman, a great designer but hesitant author (I am still working on him), volunteered his template because he couldn't stand my "brand" either.

Let me know what you think.

Monday, January 09, 2006

GSA: Meta-data not essential for search

A participant on the iECM mailing list recently posted a link to an article about the GSA concluding that metadata is not essential. The study, based largely on information from industry experts, found that search technology is good enough that full text indexing is sufficient and no manual human intervention is necessary.

I am sure that my taxonomist friends are working up a worthy rebuttal. But lets just consider the proposition that, at least in the case of normal textual information like this blog entry, manual keyword assignment is not essential. Of course, as the article states, this does not apply to graphical content or numerical data which cannot be parsed and into words that would match a textual search query. But in the case of text, is it reasonable to assume that the author will, in the course of writing, wind up using words that a prospective searcher will search for? There is the issue of synonyms and word choice, and word stems but that can be accounted for in a good search algorithm (When the query request contains "blog" also look for "web log," and "journal". When the query request is "running," look for "run"). Google seems to do a good job.

Interestingly, the commercial search engines all ignore keyword tagging because it is so often abused. I am reminded of the Extreme Programming philosophy about commenting your code (at least at the method level). The code itself should be clear about what it does without explanation by comments. The need for comments is a symptom of overly complex and, therefore, hard to maintain code. I re-read this post and it contains every keyword that I would have used to classify it (search, query, metadata, GSA).

To be honest, most clients I have worked with have either neglected or abused keywords. Either don't understand the value of keywords and don't bother, or they try to game the system so that their content gets put in visible places (yes, this even happens on corporate intranets).

So what if we said to our authors what the commercial search engines tell us: "Don't worry about meta-data tagging, just write good content and we will bring you the right readers." Where would we be?

But Metadata is not just keywords. Look at the basicLibrary of Congress search page. See how you can search on different metadata fields to get what you want? Metadata also helps content reuse. For example, if the title, summary, author, and other attributes of content are stored in structured way, they can be shown on pages that list many content assets, not just the detail page. A 50 word summary is more valuable than the first 50 words of a 10,000 word document (unless the author is especially good at getting to his point. I noticed that in this entry I lead in talking about iECM which this post has nothing to do with). Structuring a portion of content also helps with things like sorting (as in by date, author, etc.).

Metadata is what content management is. To quote a recent CMS Watch article by MarkLogic CEO Dave Kellog:

That is, while ECM tracks and manages a lot of information about the content, it actually does relatively little to help get inside content. Despite its middle name, ECM today isn't really about content. It's about metadata.
Without metadata, an ECM is just a file system with versioning.

So, it looks like authors are not off the hook. Interestingly, in the library world, the people who write the metadata are different from the people that write the content. Unfortunately this is too costly for most corporate environments that casually create and use content and don't have the budget for a full time librarian staff.

Friday, January 06, 2006

From Vignette to OpenCMS

Apoorv Durga on his PCM Blog has a nice post about a migration from Vignette to OpenCMS. The overall project went well and the client was pleased with the results. Apoorv also points out that some features are missing from OpenCMS most notably large site features such as replication and backup. For example, in Vignette, you have a staging server where you manage content and a production server where you display content. OpenCMS does not have that so you you have one server (or cluster) doing both content management and delivery. This may have security and (in extreme cases) performance implications (although with caching turned on and clustering, it is likely that it is not an issue.).

If content syndication is important, you might try Magnolia which has a subscriber model that allows an authoring server to publish content to display servers. Of course, Magnolia is missing many of OpenCMS's advanced features such as versioning and workflow. But if those are not important (frequently people think they need these things more than they actually do), you should give Magnolia a look. It uses a JCR (implemented by The Apache JackRabbit Project) and that is not something that you see in many commercial products.

Making Sense of the ECM Market

Alan Pelz-Sharp has a nice article on CMS Watch about trends in the ECM market. In this article, Alan (I think rightly) points out that ultimately the large infrastructure plays (Microsoft, EMC, IBM) are going to gobble up the space "managing unstructured data just the same as they currently manage structured data." The article goes on to say that acquisition may not be a bad thing because it may lead to more investment into the platform and better support. That is a likely scenario unless one of these giants acquires multiple platforms and lets all but their favorite wither away (for a while this is what was happening at divine when it had both Content Server and Participant Server).

One of the most insightful observations in the article is that the impending roll-up may lead the large ECM players, in anticipation of being acquired one day, to ignore the small and medium enterprise because these clients will not mean anything to the potential acquirers. Rather than risk being underserved by the large ECM vendors, Pelz-Sharp suggests Small and Medium Enterprises to consider Best of Breed products, including Open Source, that can "solve pressing problems in a simpler fashion."

Good point. I hadn't thought of that.

Labels:

Wednesday, January 04, 2006

An Evening with Joomla's Mitch Pirtle

Last night BostonPHP hosted an evening with Mitch Pirtle of Joomla! fame at our our Boston office. Mitch is a great speaker and his passion for Joomla! and open source really came through in his presentation. While most of the talk turned into an introduction to Mambo, Mitch did weave in some good background about Joomla!'s break from Mambo and where Joomla! is going. He promises to come back to get more into details about technology behind Joomla!

First, some details about the split. Mitch related his experience of sitting in a conference booth representing Mambo and then seeing an open letter announcing Miro's formation of the Mambo foundation without involving the core development team. Just at that moment, he saw Eben Moglen from the Software Freedom Law Center coming up the escalator. Eben and the SFLC were critical in guiding the Joomla team through a process littered with potential land mines. Joomla! also received support from VA Software who donated hardware, software (SourceForge Enterprise Edition) and hosting services for the new Joomla Forge. Rochen donated hosting for www.joomla.org and, before long, Joomla! and Open Source Matters, the holding organization for Joomla!, came were born.

According to Mitch, nearly all of the core development team and much of the community, as well as many third party component developers have shifted over to the Joomla! side. Today the Joomla! project is thriving. The forums already have over 160,000 posts and are growing at a 1,100 per day pace. There are already 11,000 registered developers and 700 projects on Joomla Forge (just like the big SourceForge, many have not been started yet). One interesting project that is going on is to put a Joomla! front end on SourceForge using the new Web Services API. VA is helping with the initiative.

Packt publishing, who sells Building Websites With Mambo, is planning on publishing a similar Joomla! book. I have not recently talked to the Mambo team, which has reloaded their core team with new developers, but it does seem like Joomla! has the momentum of the two projects.

Mitch talked a little about the new 1.1 release (due out soon). The key advancement of 1.1 will be full UTF-8 support. This feature trumped a bunch of other items on the roadmap because of the urgent need for supporting an extended character set. While the team was working in the core, they couldn't resist doing some deep refactoring and modernization of the code. Thanks to better code design, the new version of Joomla! is expected to be faster. Joomla! 1.1 also introduces the first steps of a database abstraction layer which will make it easier to run Joomla! on databases other than MySQL (Postgres will be supported soon with commercial databases like Oracle and SQLServer soon after). Templates will use the templating engine patTemplate rather than simple PHP. Great, another tagging syntax to learn! 1.1 also brings a more sophisticated error handler.

People hoping for user facing features like fine grained security, workflow, and versioning will have to wait for version 1.2 which looks like it is going to be huge. Interestingly, a lot of this code has already been written but Joomla! is being conservative about how much new stuff to release at a time to reduce the pain of upgrades. Support for PHP 5 will have to wait for Joomla! 2.0, a total rewrite which will probably be based on the Zend Framework to which Mitch's company JamboWorks is contributing a security module.

I asked if separating from Mambo gave the project more freedom to extend and modernize the application and Mitch did say that backwards compatibility with earlier versions of Mambo was a constraint that they are happy to be released of. I don't know if the team would have done the level of refactoring that they did if they were worried about a migration path. Also, I am sure that the feeling created by starting something new energized the team.

During the talk, there were some references to some useful Joomla! resources:

  • The API site was unveiled. This site uses the PHP library tool phpDocumentor to automatically generate documentation based on comments (like Javadoc). The comments are a little thin right now but now that the API site is up, there will be more incentive to write good comments.
  • The JamboWorks Template Club is a membership based service which gives access to a collection of pre-made templates. There are currently 12 templates on the site and a new template will be added every month. There were some great examples in the demo. $75 per year gives you access to the whole collection.
  • Boston PHP has just published a release candidate of josCommerce which ports the popular mosCommerce eCommerce component to Joomla.