A common belief holds that once something gets onto the Internet, it’s there indefinitely. This is true, no doubt, for some things. Does anyone believe the Paris Hilton sex video will ever not be out there?

Not everything does last, however, including some creative works that matter a great deal. Many of the works we should want to keep around disappear all the time, for a variety of reasons.

With work that’s born digital, this need not happen. Blogs and other material are a perfect example of digital material that never be lost — and it would take a relatively small number of people in the right positions to jump-start this idea.

Consider, for instance, the “place blogs” that have become a valuable part of local lore, sites on which citizens talk about what’s happening in their own communities. These blogs depend, for the most part, on volunteer efforts. As anyone who works with volunteers knows, however, their ardor for the task tends to flag over time. Bloggers start and stop, and when a blogger gives up his or her work can disappear. Links die along with what they’ve created.

At the Library of Congress this week, where members of a workshop were discussing how to preserve digital news in this networked era, there was little dispute that blogs are serving an expanding role in the news ecosystem. But as an archivist for the state of Wisconsin said, he has enough trouble keeping an archive of community newspapers without having to deal with the place blogs that have sprung up in town after town and city after city.

The blogging software I use, WordPress, has a Tools menu in the administration settings. Among other things, I can import, export and upgrade the blog. When I export, WordPress saves a full archive of the blog — everything I’ve created and uploaded online — into a package that I can move to another WordPress blog or even a site created with competing software.

Suppose we could convince the makers of all common blogging software platforms to expand this option, by giving users the ability to easily send what they’ve been doing, ideally on a regular schedule, to the Internet Archive, Library of Congress and/or other repositories willing to save these collected works.

The blogger should be able to select from a number of archiving options. For instance, I’d suggest a setting under which the blogger could tell the archivist that if the blog went “off the air” the archivist could restore it to the Web (albeit under a different URL hierarchy in most cases). Another useful element of this auto-save system could be as a way to have a rescue plan after a data loss. An export to the archive should also offer the possibility of an import from it. (Would we need an option to let the user remove the material from the archive if he or she decided it should not remain public? I’d guess we would, and should.)

We could make this happen if a small group of people agreed on some basics. The conversation would need to include blog software companies including WordPress, Movable Type, etc., plus potential storage services including the Internet Archive, the Library of Congress and others.

Assuming we could make this happen, the next step would be to lobby bloggers, to persuade them that saving their work to public archives would be a good idea. They could know that their work, if they chose, would be around for some time.

Any site running on a reasonably standard content-management system could be made to work this way, though the more customized the site the harder it may be. And eventually we’d want to have the big database folks — talking to you, Oracle (especially now that you’re going to own MySQL; yike) — in the conversation.

There are other ways to go at this. A workshop group led by Vijay Ravindran, chief technology officer at the Washington Post, came at the overall issue from the “pull” side of the ledger — with an ear toward the demands of the traditional media companies that will cede even a small amount of control over their content about a month after hell freezes over. They suggested a much better system of website notifications (using HTML tags) to notify crawlers that use robots.txt of what’s available in what ways. This would definitely be an improvement over what we do now, but only a partial solution. We want to preserve the entire hierarchy of the site along with everything that’s appeared on it, in full.

Nothing I heard in Washington begins to solve a more interesting problem, which is that so much of what we do (as opposed to view) these days comes from hyper-dynamically generated pages. Look at Everyblock, for example. How can we archive what the various pages that users create on the fly? (Should we? My belief is that, yes, we should know for posterity, in at least an aggregate sense, what was being created by the people who used the site.)

Right now, the blog idea seems like the low hanging fruit, though. I’m getting in touch with Brewster Kahle at the Internet Archive to see if he’s interested, along with several of the folks who do the blogging software, and will let you know what they say.

The bottom line for all this is, I hope, obvious. If you are creating things, you should not just own them, but preserve them. This is one way to keep our work alive.


4 Responses to “Archiving the News: Auto-Preserving Blogs”
  1. I would love to have an “Export to Internet Archive” button on my blogs. The Archive was very helpful to me in reconstructing large portions of my blog after I impetuously “disappeared” it several years ago. I have just now begun the project of re-entering and posting the remaining 40% of posts that were not captured by the Archive. . .

    The implication for anthropologists, historians and other students of culture is huge. But also — wouldn’t it be amazing to be able to read your grandmother or grandfather’s blog?

  2. Maureen Pennock says:

    Hi there,
    Sounds like you’d be interested in our ArchivePress project – it’s a really simple solution to archiving blog content. More info on our website (and blog of course) at http://archivepress.ulcc.ac.uk/
    Maureen.

  3. dan says:

    You wouldnt believe how long ive been searching for something like this. Went through 5 pages of Google results couldnt find diddly squat. First page of bing. There you are!…. Really have to start using it more often!

  4. [...] last year’s digital preservation meeting I suggested that we needed better ways to do our own archiving of blogs and other social media. I still believe the Library of Congress, Internet Archive and [...]

  5.  
Creative Commons License
Mediactive by Dan Gillmor is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States License.
Permissions beyond the scope of this license may be available at http://mediactive.com/cc