Article Review #1: Why It’s So Important to Understand What’s In Our Web Archives


This is a review of the article Why It’s So Important to Understand What’s In Our Archives by Kalev Leetaru and published online at

The article discusses the trade-offs in trying to “…preserve an infinite stream with finite resources.”  How do you prioritize web-based content?  How much should you aim to preserve?  How often should you preserve content?  What features of the site itself should be preserved?  How should changes or corrections be identified from original sources?  Or is that even important at all?

According to the author, the biggest obstacle facing those who would use archived web content is in the lack of transparency in the heuristics of crawlers, cookies and inclusion criteria.  The reason for this is that only big data can provide a wide enough lens of a site in order to ascertain biases that are only visible when analyzing an entire dataset.   In addition, the metadata can also provide clues that are only observable when the data is being used in novel ways, outside of the original architect’s intentions.

The author relates this lack of transparency as akin to reading a book that contains no index.  You can read the content provided to you, but you have no easy way of ascertaining what is (or isn’t) included in that content.

The author’s answer to this problem:  open source the archive’s infrastructure code…let the “community” contribute their own expertise and time to find, fix and add to the current structure and metadata as a means of resolving a mammoth drain on the archive’s existing staff & resources.

While I think in some circumstances these private/public joint collaborations might work (most notably efforts by the International Astronomical Union and Zooniverse), these tend to be in user-friendly ways that don’t require the expertise of someone with exact coding and/or technical training.  In asking for help with metadata structures and analysis, I believe the author is hoping for too much.  Such efforts would not be consistent nor reliable, and that’s only if you could generate enough public interest to participate in your project in the first place.  I believe that institutions have to either face the fact that they are going to have to pay for this particular kind of expertise and HIRE people to do the work, or else they need to put more time and effort into the development of automated systems that can help ease the burden on existing staff.

Outsourcing can be good for some things but should never be used as a catch-all for areas that require real professionals to do the work.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s