Painless Data Augmentation with BigQuery | by Austin Poor | Jan, 2021


Quickly Augmenting Your Datasets with BigQuery Public Data

Photo by Lukas Blazek on Unsplash

Google Cloud’s BigQuery is a great tool for data scientists to quickly and easily augment their datasets with external data. Specifically, BigQuery has a listing of public datasets from a variety of different sources. All you need is a Google Cloud account and some basic SQL knowledge.

Here are just a few useful public datasets:

I think one of the most useful among the BigQuery public datasets is the US Census ACS data, which gives multi-year data broken down geographically (by state, zip code, county, etc.).

It has a lot of great demographic information like population (broken down by age, race, gender, marital status, etc.), education levels, employment, income, and much more.

For example, say I wanted to query the total population and median household income for three zip codes in the NYC area. There’s a table called that gives a 5-year estimate of census data for the year 2018, broken down by zip code.

Here’s what my query will look like:

geo_id, -- Where geo_id is the zip code
geo_id in ("11377","11101","10708");

And I can run it in the BigQuery UI…

Screenshot of the BigQuery UI

And get the following results…

Viewing query results in the BigQuery UI

Great! I got my answer in 0.4 seconds and now I can go back and expand my query to get this data for multiple years. Or, I can export the results to a CSV or JSON file to join it up with my data.

Screenshot showing export options for BigQuery results

Finally, as a bonus, you can connect to BigQuery through Python with the following package:

Read More …


Write a comment