Archiving and Logging Your Use of Public Data | by Dan Valenzuela | Oct, 2020
Dealing with the impermanence of public information units
One fear that I at all times have when downloading information units off the web is their impermanence. Links die, information adjustments, ashes to ashes, mud to mud.
That’s why I’ve been introducing the Wayback Machine into my workflow. But even then, it’s powerful to be per whether or not I’m downloading information off an archived web site or a stay web site and it’s powerful to grasp what I did previously.
What I’ve performed via my work with the Survey of Consumer Finances (SCF) is implement a system of concurrently archiving and logging the information that I take advantage of. Below is a abstract of what I’ve performed however if you happen to simply wish to see the code, scroll to the underside of this submit for a gist of the capabilities I’ve applied with respect to the SCF.
Building Off of WaybackPy
The massive factor I needed to perform with this undertaking was to make it possible for I used to be utilizing latest Wayback archives as a lot as potential when downloading information. Else, with any information that didn’t have archives, I needed to make certain it acquired archived on Wayback for future use.
The greatest package deal I discovered for this was WaybackPy, though I wanted to make a number of adjustments to make it work for my functions.
First, I wanted to implement attributes that will permit me to see the age of the newest archive. This approach I may test to see if a brand new archive could be wanted given a supplied
By approach of illustration, if you happen to had been to get the
len() of a WaybackPy Url object passing via
www.google.com as an argument you’d get 0, or the quantity of days for the reason that final archive.
import waybackpyurl = “https://www.google.com/"waybackpy_url_obj = waybackpy.Url(url)print(len(waybackpy_url_obj))
Makes sense given that a new archive is practically generated every day.
Archiving and Logging a Data Set
With WaybackPy, archiving is the easy part. Just make sure to call the
save() method for archives that are older than a certain age, store some key data and you’re good to go.
The harder part is adding that key data to a CSV for further examination. This is especially the case for new archives that cannot be directly accessed from the Wayback Machine.
Below is an implemented function that takes in as arguments the URL where the data you want exists and the target directory where it is locally saved. What it returns is a dictionary containing the URL and timestamp of the latest archive. To examine its behavior let’s look at how it’s used with the SCF.
Implementation with SCF
At the bottom of this post is a complete gist showing how to archive, log, and download SCF data. The process is relatively straightforward:
- Provide the year of SCF data you want, the variables you want, and the target directory where you want the data stored.
- Use the latest archive or create a new archive if the latest is older than your limit. Simultaneously, log the latest archive that’s being used or the new archive.
- Try to use the archive URL to retrieve your data, else use the live URL.
- Convert your data into a Pandas data frame.
The log itself behaves in the following ways:
- If the latest archive is within your age limit, it will record its details and if you try to re-download that data again while within the age limit the log will not duplicate the record.
- If the latest archive is over the age limit, it will record the details of your save and if you re-download it. Because Wayback may not show your recording immediately, it’s likely you might end up with multiple saves and records in your log, which is preferable to losing that save data when Wayback doesn’t work.
So if you were to download the SCF data from 2019, 2016 and 2013 today , the last one being downloaded twice, you would expect the log to show 4 records with
archive_age_limit = 30 since the 2019 data set is the only one that has an archive within the last 30 days. Below is an example of what you would see.
The log I’ve implemented is definitely useful for my purposes, but I’ve yet to explore fully how this code can break. It’s especially difficult to figure out how to best deal with recently-archived sites that aren’t updated on Wayback immediately, but the “over-logging” conduct described above is healthier than none.
The other potential issues might be with different file types but the basic archive behavior should stay the same. At the very least this code will work for the SCF’s zip files if you feel so inclined to take a look at it.