Generating Mock Data with Python! (NumPy, Pandas, & Datetime Libraries)
In this video we write a python script to automatically generate a sales dataset. To do this we use the NumPy, Pandas, Calendar, & Datetime libraries. This is ultimately the data that we used in my last video “Solving real world data science problems with python pandas”.
Link to the last video:
Link to finished code on GitHub:
Detailed video description!
We start by creating a simple dataframe and programmatically adding rows of product purchases to it. We use the random library to select these products.
We make our data more realistic by utilizing normal distributions and geometric distributions in numpy to spread out the number of purchases we make and the quantity of each item purchased.
We use the datetime library to allow us to generate thousands of different times for each purchase with the most common times peaking around 12pm and 8pm.
We take a list of the most common US street addresses to help us randomly generate addresses for each purchases.
Hope you guys enjoy! Make sure to subscribe if you haven’t already 🙂
⭐ Kite is a free AI-powered coding assistant that will help you code faster and smarter. The Kite plugin integrates with all the top editors and IDEs to give you smart completions and documentation while you’re typing. I’ve been using Kite for 6 months and I love it! https://www.kite.com/get-kite/?utm_medium=referral&utm_source=youtube&utm_campaign=keithgalli&utm_content=description-only
Creator: @Chris Chann
0:00 – Intro & Background Info
1:15 – What we’re creating in this video!
2:03 – Start writing code (generating a simple dataframe & csv)
8:26 – Task: Making our data more realistic, selecting some products with higher probability than others
14:15 – Task: Generate 12 months worth of data in 12 csvs (calendar library, f-strings)
18:12 – Make some months have more purchases than others
19:28 – Normal distributions in NumPy
23:43 – Improving speed of our code (making testing easier)
26:41 – Task: Generate random addresses for our data
35:03 – Task: Generate order times for purchases (datetime library overview)
40:02 – Using timedelta objects to add & subtract time from dates
45:09 – Generate a realistic quantity ordered for each product (using numpy geometric distribution)
49:38 – Add multiple items being more likely to be sold together and cleaning code a bit
*I use affiliate links on the products that I recommend. I may earn a purchase commission or a referral bonus from the usage of these links.