« Grip it and Clix it | Main | Meaning and Animal Compassion »

November 14, 2004

Whoever has the best library wins

Coming to you straight out of the American Society for Information Science and Technology Annual Meeting in lovely downtown Providence R.I. is every ones favorite semi-structure data source, Mr. metametadata himself, the Keeper!

Thank you, thank you. I'm very delighted to be here. I just have two things to say.

A very bright person, J. C. Hertz, who makes a living applying game design principles to information systems for major products, services and learning systems (she's principal of a company called Joystick Nation, Inc.) just made some statements I took issue with in her plenary session.

These statements are (paraphrased):

any system that requires people to add their own metadata is doomed to fail

My discomfort with this statement is merely linguistic or epistemelogical. She spoke of "accidental metadata" and provided the classic example of an image caption as accidental rather than I don't know, maybe intentional? I'm not sure I understand the "accidental" domain. Does she mean information that can be mined from the data object itself? Does she mean contextualizing information that people add as they position their data objects in some information space? Does she mean information that is the result of interaction with data objects? All of these are candidate activities for generating metadata byproducts. In fact the product of some of these activities really wants to be both primary data and metadata at the same time.

She most strongly made a distinction between people providing unasked for information as they positioned their data in an information space versus information provided as a result of the information space dropping a cataloging form in front of people and saying you have to fill this in before you can submit your data object. I think that in either case there is an expectation on the part of the owners/participators in an information space for a provider of a data object to also provide metadata. My response to Ms. Hertz is that there just isn't social momentum for making that expectation explicit. There still are strong unspoken social pressures in the information culture that keep most people from dumping a bunch of uncontextualized data into the pool. What metadata specialists do is find ways to work with data providers to take advantage of these pressures to formalize "accidental" metadata offerings to the providers' benefit.

For example, the Metadata Services Unit in the MIT Libraries provides metadata application services for MIT OpenCourseWare. The Unit creates Learning Object Metadata records for OCW resources. MIT OCW is in many respects an electronic publication. Like other publications they provide image captions. These captions contextualize the image and provide copyright information. There is no official publisher's rule that says you have to caption your images, but most people do. The caption is metadata that just plain makes the image more effective as an information resource. It's in the interest of MIT OCW to provide this caption and they've mandated it's existence in their style guide. They are, like most data providers moving to electronic information delivery, unfamiliar with xml, semistructured data and metadata records. It is natural for them to write a caption into a web page right where they embed the image. This provides a benefit to viewers of the web page and also means there's some text about the image that gets indexed for the search engine. The caption in the html makes the image more easily discovered. It is the role of the Metadata Services Unit to find this metadata OCW has provided and get it into a Learning Object Metadata record. Why duplicate the existence of this information in data and metadata? Because now, when you search for images and get a list of them back, you get the caption included in the list. It is much easier to capture the caption in a metadata record field and then display that field on the search results page. If you doubt that this is easier than grabbing the caption straight of the html page go try to make sense of the keyword in context part of a google results list. The two metadata record fields presented in a search results page are General.Title and General.Description. Because the Unit put the caption in the description field patrons have access to that information as they're making a decision whether or not to actually view the item. Once again, metadata (same info, different wrapper) aids information retrieval.

What this seems to point out to me is that there has always been a fuzzy line between data and metadata, and that the information explosion precipitated by hypertext and networking technologies has brought this confusion into high relief. The more stuff that gets into the information space the faster we seem to need to produce even more. This means that we start demanding more for our metadata buck, it must do more for less effort. When folks start drawing clear lines where none existed before it's often because they're drawing in the wagons and abandoning any effort that doesn't seem to realize a return. In this kind of environment there really should not be any metadata "accidents". It's the responsibility of information science professionals to make data providers aware of this and help them form a stronger, more cost-effective relationship with metadata rather than attempt to abandon it all together.

libraries represent a well-organized, well-funded, well-mandated effort that doesn't scale well

This rounds out the arguments against human analysis and metadata production for the mass of data on the internet. Basically, data providers are lazy and data organizers can't afford to send enough librarians at the task. I think that both are misunderstandings of the current information economy. We've seen that data providers do provide metadata intentionally and there are methods to formalize this relationship, cementing the social motivation to provide metadata and realizing an economy.

We can also find the funds to catalog the internet, we've just been looking in the wrong place. Before networking and hypertext it was expensive to get a bunch of data into the world. For ease of distribution a bunch of ideas would aggregate (think gravity) into a book, an easy way to share a lot of information in one package.

It's no longer so expensive to send a bunch of ideas out there so they tend to avoid aggregation (think dark energy pushing infobits apart). The semantic density of the ASIST conference proceedings book seems galactic in scale next to this humble weblog post.

Now this same semantic density scale correlates with the scale of effort on the part of the author. It takes a book author much longer than the 30 minutes or so I've spent throwing of this ephemera. And the scale of reward is similar for the author. But this scale of effort does not hold for the librarian who might catalog either on their way to a library of digital repository. It takes the same 15 to 20 minutes to catalog a book as it does one weblog entry.

This is what Ms. Hertz means. There's enough information in books and serials to justify an international collaborative effort amongst governments, institutions of higher learning and information professionals to provide a sophisticated level of information organization.

The information space that is the internet is screaming for this same level of organization, but the justification for a similar effort on the part of librarians seems absurd. However, if you look at the scale of effort and reward on the part of data providers, it seems apparent that the creators and aggregators of this content could and should shoulder this burden, even if it means hiring the librarians away from Universities and Public Libraries. Librarians need to focus on marketing their service offerings to the data providers wherever they are. This is our entry point, how we get to catalog the world wide web. It's likely to happen piecemeal at first. Especially considering the pluralization of data providers in the current information economy. Don't shy away from any project, no matter how small. We also need to find the right social leverage points to encourage good information organization behavior on the part of those who are producing the mass of content, again no matter how small.

Okay, things to say number 2.

Search Wars, Killer Apps and Why Librarians WILL Eventually Rule the World from the ASIST Special Interest Group: Information Architecture listserv (Thanks Boniface Lau)

It seems that the theme of this post is the old adage from Brown and Duguid (The Social Life of Information), "Social change must precede technological change".

I think that librarians get this and are working very hard to improve their ability to effect the needed social change. I think that the computer and internet technology communities are lagging behind. Their response to the need to improve the quality of metadata that accompanies electronic resources is not to leverage the social pressures that will prompt folks to provide this information, but to find some algorithmic, programmatic solution to pull it out of whatever a human sees fit to deposit. If you expect people to be lazy they will. If you expect more, they'll surprise you.

I don't think that you can process the amounts of data were now seeing through human effort alone, neither can you do it by machine alone. Even google employs an army of librarians to tweek the algorithms and improve the results of programmatic analysis. I just heard a great presentation at MIT by Gerry Marchionini from UNC Libraries about automatic generation of metadata. He thinks that we should be educating a whole generation of young people to be more savvy at information retrieval and interaction, including algorithm maintenance and development. He thinks that we can very easily develop more sophisticated information resource users with higher expectations for information systems and he promotes something he calls Human Computer Information Retrieval, the collaboration between man and machine to solve the problem of interaction with growing amounts of data.

Did you notice that the article reference is from a listserv sponsored by the group throwing the conference I'm attending?

And, yes I noticed that I'm blogging about information technologies like blogging from a conference about information technologies like blogging (I'm also blogging about the conference itself, from the conference itself) and that this is some sort of metametageekiness. As you should have realized long before now, I'm very comfortable with metameta events and environs, and geekiness.

Posted by Keeper of the Blog at November 14, 2004 04:10 PM | TrackBack