“The Internet Archive turns 20 years old next year, having archived nearly two decades and 23 petabytes of the evolution of the World Wide Web. Yet, surprisingly little is known about what exactly is in the Archive’s vaunted Wayback Machine. Beyond saying it has archived more than 445 billion webpages, the Archive has never published an inventory of the websites it archives or the algorithms it uses to determine what to capture and when. Given the Archive’s recent announcements of new efforts to make its web archive accessible to scholarly research, it is critically important to understand what precisely makes up this 445-billion-page archive and how that composition might affect the kinds of research scholars can perform with it.”
Kalev Leetaru discusses more here.