Google’s Digitization Project

I began thinking about digitization recently after hearing a presentation recently about the National Archives and Records Administration’s Electronic Records Archives, their huge project to design a system to capture and retain digital records produced by the federal government. While the initiative (referred to as ERA, apparently the acronym is back up for grabs in this post-feminist world) is very interesting and has archivists worrying about what the essential characteristics of each record type are, and how the system can be both massively scalable and simple enough to be easily reverse-engineered if found in some distant future.

Of course, the flip side of retaining and preserving documents which originated in digital formats is the ongoing effort to digitize existing print material. Google is one of the leaders in this area with the project they announced in 2004 to scan the contents of Harvard, Stanford, the University of Michigan, the University of Oxford, and The New York Public Library. I began to wonder where the project stood today, as I hadn’t heard much about it since 2004. Since the project announcement the University of Michigan has set up this webpage with plenty of information about the project, and in recent months President Coleman has been vocal defending the project’s merits, presumably as a reaction to the hostile reaction it has found by the big publishing houses. I thought it cast the University in a familiar role — it famously and stubbornly took the Affirmative Action issue to the Supreme Court, so I can see them taking quite a stubborn line on the issue when they’ve decided it is the Right Thing. After all, how far can you really stick your neck out with a multi-billion dollar endowment and plenty of free PR to gain?

The politics of the University’s principled positions aside, I wondered exactly what’s going on in Ann Arbor with the project. This article, published in something called the Book Standard, describes some of the machines Google might be using to scan the U-M library. The company announced they intend to scan the library’s approximately 7,000 volumes in just six years. Imagining a secret warehouse of exotic scanning machines manned by dozens of workers sworn to secrecy, I set out to calculate just how many machines Google might be operating in some Ann Arbor warehouse (or to speculate, the Buhr Shelving Facility)

Here’s the numbers: if there are 7 million records and each book contained on average 250 pages, that would mean about 1,750,000,000 pages. To finish in six years, Google would have to scan 799,087 pages each day, or 33,295 each hour around the clock. According to the Book Standard, the fastest scanning machine in the world can do 3,000 pages an hour, meaning Google would only have to own 11 machines. If they ran them 12 hours a day they’d need 22. At $225,000 each 22 machines comes to around $5 million. Although not an insignificant number of machines, it’s certainly a far cry from the vast warehouse of my imagination.

Here’s where you come in, fine readers: who can send me some smuggled photos (or observations) of the Google machines at work?

Author: Rob Goodspeed


  1. I just happen to know someone on the project, and they are quite secretive themselves. From casual conversation I have the following vague tidbits:
    1. Google has recently moved from their small rental space on campus to a dedicated warehouse outside of downtown.
    2. The number of machines is approaching 100 (seriously, $5 mil is drop in the Google bucket…you think they’d do this on the cheap?).

Comments are closed.