Page 1 of 1

Over 200 terabytes of the government web archived

Posted: Sat Jul 12, 2025 5:43 am
by asimm22
In our December post, “Preserving U.S. Government Websites and Data as the Obama Term Ends,” we described our participation in the End of Term Web Archive project to preserve federal government websites and data at times of administration changes. We wanted to give a quick update on the project — we have archived a heck of a lot of data!

Between Fall 2016 and Spring 2017, the Internet Archive archived over 200 terabytes of government websites and data. This includes over 100TB of public websites and over 100TB of whatsapp lead public data from federal FTP file servers totaling, together, over 350 million URLs/files. This includes over 70 million html pages, over 40 million PDFs and, towards the other end of the spectrum and for semantic web aficionados, 8 files of the text/turtle mime type. Other End of Term partners have also been vigorously preserving websites and data from the .gov/.mil web domains.

Every web page we have archived is accessible through the Wayback Machine and we are working to add the 2016 harvest to the main End of Term portal soon. While we continue to analyze this collection, we posted some preliminary statistics using the new Wayback Machine’s summary interface for this specific collection, which can be found on the End of Term (EOT 2016) summary stats page; those and additional stats are served via a public EOT 2016 stats API and the full collection is also available.