Tuesday, July 26, 2005

Wikipedia Data and Anecdotes

I like both data (that is generalized data from lots of people) and anecdotes (specific data from a few). Her's a bit of both about a topic that I follow with interest-Wikipedia. Why the interest? Well, for one thing, I find it a personally useful resource. For another, I consider it a bit of a bellwether of interactive collaboration.

Here's the home page for the stats. Diving down a bit, it looks like there might be at least some suggestion that the rate of new articles generation is slowing (at least in English) although I'd hesitate to draw any real conclusions before seeing more months of data.

On the anecdotal side, Tim Bray points some of the usual problems:
Dave Winer's right, the Wikipedia's article on RSS is a crock. Dave's gripe is that it's "highly political", mine is that it's just wrong: for example, the introductory bit suggests that full-content feeds are impossible. Also, it's badly-organized. Dave's problem is going to be harder to address because RSS itself is highly political; but at least the political narrative should be coherent. Anyhow, it would be nice if someone level-headed were to take responsibility for it. I currently ride herd on two or three other articles and that's all my Wikipedia cycles. It's not as hard as you might think, and here's why: the kinds of people who want to put stupid, irrelevant, badly-written junk in the Wikipedia in my experience are easily discouraged. Just hang in, keep on fixing things they break and explaining why in a calm tone of voice on the Discussion page, and pretty soon they go away.

I'm not sure I fully share Tim's "it will work out" faith. That said, I think it reinforces the view of Wikipedia as a valuable resource-but not a totally dependable one.

Wednesday, July 20, 2005

Some More on Digital Archives

Although not directly on the point to the current Internet Archive case, this piecewritten by Adam Mathes, Graduate School of Library and Information Science, University of Illinois Urbana-Champaign has some interesting discussion about archiving software programs for preservations purposes--as well as current exemptions to the DMCA that aid in that effort.
In addition to processing issues, this brings up some of the legal issues involved in the collection. The mere act of copying these digital works, especially for the eventual purpose of enabling access on a different hardware platform, should arguably be considered a fair use. However, if these disks have "copy protection" schemes, even outdated ones that can be bypassed, care must be used to make sure the collection does not run afoul of the Digital Millennium Copyright Act (DMCA). Although recently Archive.org was given an exemption for particularly this reason it presents a considerable barrier and must be dealt with. (1) It may be helpful to amass multiple copies of the works in many formats to further bolster the legal backing to shift and archive the materials.

See also this reference:"Internet Archive Gets DMCA Exemption To Help Archive Vintage Software." Internet Archive. 2003. February 23, 2004.

Friday, July 15, 2005

The Internet Archive continued

In response to my last post, fellow analyst James Governor speculates that perhaps libraries like the Library of Congress might not be one possible analogy.

If, for purposes of argument, we consider the Internet Archive a library, that could well grant them some exemptions to the rights conferred to content creators by copyright law. Consider this from Section 108 of the U.S. Copyright Act:
§ 108. Limitations on exclusive rights: Reproduction by libraries and archives
(a) Except as otherwise provided in this title and notwithstanding the provisions of section 106, it is not an infringement of copyright for a library or archives, or any of its employees acting within the scope of their employment, to reproduce no more than one copy or phonorecord of a work, except as provided in subsections (b) and (c), or to distribute such copy or phonorecord, under the conditions specified by this section, if—
(1) the reproduction or distribution is made without any purpose of direct or indirect commercial advantage;
(2) the collections of the library or archives are
(i) open to the public, or
(ii) available not only to researchers affiliated with the library or archives or with the institution of which it is a part, but also to other persons doing research in a specialized field; and
(3) the reproduction or distribution of the work includes a notice of copyright that appears on the copy or phonorecord that is reproduced under the provisions of this section, or includes a legend stating that the work may be protected by copyright if no such notice can be found on the copy or phonorecord that is reproduced under the provisions of this section.

See also here and here for various pointers about copyright law as it applies to libraries. (By the way, nothing in there about the content creator having to give permission or being able to withdraw permission--count another strike against robots.txt having any significance in this case.

However, as with many things digital, I'm still a bit suspicious of physical world analogs. Not just because of the different nature of the media, but also the nature of the institution. We can all agree that the Library of Congress is a Library and that Widener at Harvard is a library and even little Thayer Library in my town of Lancaster is a library. And it's perhaps not too much of a stretch to see the Internet Archive as a form of library. But what if I were to declare my own little web site a library and compile Dilbert cartoons there? (I picked this as an example of content that's posted publicly but only for a limited time.) My guess is that Scott Adams might not approve. Yet, what makes the Internet Archive different in any fundamental way?

Thursday, July 14, 2005

Thoughts on The Wayback Machine Kerfuffle

The Internet Archive a.k.a. Wayback Machine is being sued by a firm called Healthcare Advocates for storing copies of old web pages. (See Good Morning Silicon Valley, for example.) These archived pages are causing the company heartburn in a separate trademank dispute so it's unhappy. Further, for some reason, the pages were allegedly stored in spite of being flagged with a "robots.txt" file to not be archived, cached, spidered, etc.

The case has generated the predictable throwing up of hands in disgust throughtout the online world. As Good Morning Silicon Valley's John Paczkowski succinctly puts it: "Uh, you published that information to a public medium ..." Now I'm certainly sympathetic with the Internet Archive here. At some level, the archiving and caching of publicly-displayed web pages seems almost part of the fabric of the Web and the way it works. However, I'm less convinced than some others that this is Much Ado About Nothing. I preface the following comments and observations with a standard "I Am Not a Lawyer"--and would welcome any on point case law that might be relevant here.

I think we can all stipulate that web pages and such are copyrighted material and freely displaying them to the public doesn't reduce or eliminate that copyright in any way.

I do agree with John that the robots.txt angle seems a wit wacky.
Why? The robots.txt protocol is purely advisory. It has no legal bearing whatsoever. "Robots.txt is a voluntary mechanism," said Martijn Koster, a Dutch software engineer and the author of a comprehensive tutorial on the robots.txt convention (robotstxt.org). "It is designed to let Web site owners communicate their wishes to cooperating robots. Robots can ignore robots.txt."

Ignoring robots.txt may be bad manners, but it's hard to see the legal significance. (There are perhaps analogs in physical trespass laws--posting your property and the like--but my understanding is that the details of such as typically goverened by explicit state and local laws.)

However--and here I perhaps stray into less charted territory--what exactly gives the permission to copy and archive web sites anyway? Certainly, there's no explicit permission like a negative robots.txt file that affirmatively gives the right to replicate, store, transmit, archive, etc. web pages. I suppose the theory is that there is some sort of implicit permission based on custom and social contract. Which seems a rather loosey-goosey state of affairs.

I can't think of any really good analogs here. Yes, I can record TV and radio--but only for my personal use. It's quite well established I can't put those recordings on a server for all to access. Usenet postings might be the most analagous situation; they're now archived as Google Groups and in more fragmentary form elsewhere. However, as far as I know, the legal status of Usenet and other types of online postings doesn't have much case law underpinning it. Furthermore, I think one could easily argue that such postings have a more explicit element of transmission of content out into the world--with the full knowledge that said content will be forwarded and stored for at least some interval--than Web pages which reside on a controlled site.

Nor can I see the exemplary historical service that the Internet Archive is providing with its activities having any bearing. "Preservation of the past" may be a social good, but it's got little to do with copyright law. After all, Abandonware has the same legal status as any other warez in the absence of the copyright owner's explicit permission to release it into the wild.

From where I sit, robots.txt certainly seems like a red herring in this case--given the lack of laws compelling its observence. But there's a much larger issue of caching and archive that seems to rest on very sandy foundations.

Monday, July 11, 2005

Podcasting Redux

Podcasting continues to be a beloved trend of the plugged-in elite. I've commented rather dismissively about it before. Since then I've spent more time checking out the various podcasting options--both software and content. Have I revised my opinion? Not really, I still think that there are some fundamental reasons why podcasting won't have the impact of text-based RSS. Which is not to say that podcasting doesn't have merits within a limited scope.

Chris Anderson at The Long Tail gives three reasons why podcasts aren't a big deal (yet).
  1. They don't have internal permalinks to section and subjects, so they don't get much link-love.

  2. They aren't searchable. How hard would it be for some service to run podcasts through a quick-n-dirty voice recognition program to autogenerate transcripts? They don't need to be exactly right; 80% accurate search is better than the 0% we've got now.

  3. They're meant to be consumed linearly, and pretty much at the (agonizingly slow and amateurish) pace they were created. Who, aside from trapped commuters, has time for that?

These comport with my impressions, but it's the linear consumption that's the real killer. David Winer, who wrote the RSS 2.0 specification, described the web as a "skimming" medium on this Steve Gillmor podcast and you can't really skim audio feeds effectively. As a result, you end up selecting a few favorite programs that you might listen to during audio-friendly periods--which is to say, typically driving in the car. And, guess what, as professional broadcasters like the BBC start putting content on the air, (e.g. "In Our Time") many--probably most--people will largely devote their limited audio-listen minutes to professionally-produced broadcasts. Call this podcasting if you like, but it's really just on demand radio as you can record more crudely today with software like Replay Radio. (By the way, NPR, get with the program!)

So when are podcasts good? I can think of a few things, both based on my personal experiences and things I've read about.

Business uses--for example, a weekly "broadcast" to a sales force. This is sort of a special case of the "content to listen to while commuting/driving.
Certainly, interesting interviews with the sort of specialists which don't make it onto mainstream broadcasts in any depth are interesting. Steve Gillmore and Dan Bricklin's podcasts are good examples of this. Talks at conferences are another good example--as at IT Conversations.

Thus, I certainly don't argue that podcasting is "bad" or useless. I like on demand listening and RSS syndication provides a handy mechanism to more easily (if hardly automagically) get updated audio content from favored sources to my car. But I'll continue to argue that it remains a largely peripheral trend rather than the "end of radio."