Solving real world data science tasks with Python Beautiful Soup! (movie dataset creation)




[ad_1]

Data is everywhere! Enhance your career and acquire new skills by taking a course on DataCamp! Click here to take the first chapter of any course for FREE: https://bit.ly/36lKg44 (you’ll be supporting my channel too!)

In this video we scrape Wikipedia pages to create a dataset on Disney movies.

The video is formatted with tasks for you to try to solve on your own throughout. For the best learning experience, at each task you should pause the video, try the task on your own, and then resume when you want to see how I would solve it.

We cover a wide range of Python & data science topics in this video. They include:
– Web scraping with BeautifulSoup
– Cleaning data
– Testing code with Pytest
– Pattern matching with regular expressions (Re library)
– Working with dates (datetime library)
– Saving & loading data with Pickle library
– Accessing data from an API using Requests library

Link to code & datasets: https://github.com/KeithGalli/disney-data-science-tasks
Previous tutorial on Beautiful Soup: https://youtu.be/GjKQ6V_ViQE

If you enjoyed this video, make sure to like & subscribe 🙂

This video was sponsored by DataCamp

———————
Video timeline!
0:00 – Video overview
1:58 – Check out DataCamp! (sponsored)
3:12 – Setup

Task #1: Scrape the infobox from Toy Story 3 wiki page (save in python dictionary) (4:24)
Link: https://en.wikipedia.org/wiki/Toy_Story_3

Task #2: Scrape infobox for all movies in List of Disney Films (save as list of dictionaries) (28:52)
Link: https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films
30:30 – Robots.txt (Are you allowed to scrape a site?)
32:52 – Task #2: Scrape infobox for all movies in List of Disney Films (save as list of dictionaries)
57:27 – Save & Load dataset checkpoint (JSON file)

Task #3: Clean our data! (1:02:04)
1:09:28 – Task #3.1: Strip out all references ([1],[2],etc) from HTML
1:16:39 – Task #3.2: Split up the long strings
1:25:02 – Task #3.3: Examine errors we are getting
1:30:27 – Task #3.4: Convert “Running time” field to an integer
1:44:57 – Task #3.5: Convert “Budget” & “Box office” fields to floats
2:33:53 – Task #3.6: Convert dates into datetime objects
2:47:36 – Saving our data again (using Pickle)

Task #4: Attach IMDB, Metascore, and Rotten Tomatoes scores to dataset (working with APIs) (2:53:18)

Task #5: Save final dataset as a JSON file and as a CSV file (3:13:48)

———————
Extra resources!
Setup Jupyter notebook: https://jupyter.readthedocs.io/en/latest/install/notebook-classic.html
Google Colab (cloud-based notebook): https://colab.research.google.com/
Learn regular expressions: https://youtu.be/K8L6KVGG-7o

⭐ Kite is a free AI-powered coding assistant that will help you code faster and smarter. The Kite plugin integrates with all the top editors and IDEs to give you smart completions and documentation while you’re typing. I’ve been using Kite for 6 months and I love it! https://www.kite.com/get-kite/?utm_medium=referral&utm_source=youtube&utm_campaign=keithgalli&utm_content=description-only

———————
Follow me on social media!
Instagram | https://www.instagram.com/keithgalli/
Twitter | https://twitter.com/keithgalli

If you are curious to learn how I make my tutorials, check out this video: https://youtu.be/LEO4igyXbLs

*I use affiliate links on the products that I recommend. I may earn a purchase commission or a referral bonus from the usage of these links.

Source


[ad_2]

Comment List

  • Keith Galli
    November 27, 2020

    Hey everyone! Been a while, but happy to be back 😊. Spent a while putting this video together so hope you enjoy it!

    For more great learning resources (and to support my channel in the process), be sure to check out DataCamp. Click here to take the first chapter of any course for FREE: https://bit.ly/36lKg44

    As always if you have any questions or suggestions for future videos, feel free to let me know here in the comments!

  • Keith Galli
    November 27, 2020

    How I can make faster my parser of 1000 pages? This is code https://github.com/laotiq/Kolesa-Data-Parser/blob/main/Kolesa_Parser.ipynb

  • Keith Galli
    November 27, 2020

    Can you please post tutorials on KivEnt🙏

  • Keith Galli
    November 27, 2020

    for months i have been struggling in how to structure my learning journey through projects and ive finally found you, THANKS MAN

  • Keith Galli
    November 27, 2020

    i cant believe this 30 dollar material is free on youtube. i am going to promote your channel.

  • Keith Galli
    November 27, 2020

    Phew!! That was one incredible tutorial!! A BIG THUMBS UP!!

  • Keith Galli
    November 27, 2020

    The problem is how do we use this that set
    this video look like soup tutorial

  • Keith Galli
    November 27, 2020

    Big fan my bro…

  • Keith Galli
    November 27, 2020

    Great video! but why don't you just use pd.read_html('url')?

  • Keith Galli
    November 27, 2020

    Great video, excellent explanation. Could you make more video in solving real world data science projects? I would like to learn more from you. And could you make another video in cleaning data in Python? Thank you so much. Wish you all the best.

  • Keith Galli
    November 27, 2020

    Hi Keith, I just stepped into the world of data scraping, thought of something and now wondering how someone would tackle such a situation.. You go a website and you find that some data is randomly hidden behind dropdowns and not in any unique form. Once you click the dropdown icon is the only time the data will load.. More problems, all the links with data have a dropdown icon so the icons aren't placed in unique places, on inspecting data is not seen but on clicking the dropdown is when data will display on the inspect tab

  • Keith Galli
    November 27, 2020

    Hi Keith,
    Thanks for making this video, made it much easy to understand the flow and practical use of bs4 for gathering data.

    Great work.

  • Keith Galli
    November 27, 2020

    Hey Keith, I like your courses and will like to make you acquaintance, can we talk more over WhatsApp.

  • Keith Galli
    November 27, 2020

    Suppose there is a foundation that is dedicated to promote technological education in Honduras, El Salvador and Guatemala, massive intention and work, in few years illegal immigration would stop.

  • Keith Galli
    November 27, 2020

    Good project…Keith. Loved it very much. More expected.

  • Keith Galli
    November 27, 2020

    info_rows = info_box.findall("tr")
    ^
    TypeError: 'NoneType' object is not callable

  • Keith Galli
    November 27, 2020

    Thank you so much! Now I have the confidence to do projects on my own, you changed my life. It would be great if you could do videos on Tableau! 🙂

  • Keith Galli
    November 27, 2020

    it gonna very exciting if you do an analysis report of Marketing Analytics. Thank you for making this video

  • Keith Galli
    November 27, 2020

    Brilliant content and presentation style, Keith. I got everything working except extracting the API key from the Environment variable (ended up hard coding which worked). Thanks again!

  • Keith Galli
    November 27, 2020

    You young people are hacking my system I'm calling the federal rangers of doom.

  • Keith Galli
    November 27, 2020

    im a 4th year college here in the Philippine and I want to be a data scientist, but its too hard to learn,

    but I hope I can fulfill my dream im trying my best. Grinddddddddddddddddd and thank you Keith Galli

  • Keith Galli
    November 27, 2020

    We are infinitely indebted to you.
    Thanks for sharing this wonderful content 🙏🏽🔥.

  • Keith Galli
    November 27, 2020

    wonderful work in putting together this very easy to understand tutorial. a big thank you for me.

    would be great for a follow up is to put this on a schedule and load this onto a Heroku server or in a docker file for a home NAS to run. this would be great to have a periodic scraping of news or Covid update data to be sent to our own phones via Telegram or Slack or Discord.

  • Keith Galli
    November 27, 2020

    Keith, could you please explain why 'return' key word is mandatory when we load the pickle but can be skipped when we save data as pickle?

  • Keith Galli
    November 27, 2020

    Thank you for sharing! You are amazing. It would be great if you make videos about docker, and spark.

  • Keith Galli
    November 27, 2020

    why am I getting the separate key of release date while converting it to DateTime to string and now I have two release date with the same key? And How am I supposed to remove anyone cuz both of em have the same name… please help me out

  • Keith Galli
    November 27, 2020

    Hey! I love the videos they are super helpful in giving me the info to start my own projects. What would you think about doing regression videos using sklearn library (or a better library)? I can't find anything good on the internet that actually helps me learn how use it for myself later on.

    Edit: I finally figured it out and it was surprisingly simple to do linear regression with just a few lines of code. Some regression analysis videos would still be awesome though.

  • Keith Galli
    November 27, 2020

    Hair style is good one too!

  • Keith Galli
    November 27, 2020

    That is great thanks!

  • Keith Galli
    November 27, 2020

    Hi Keith! I work for a company that is building an online learning platform. Your videos would be a perfect fit! Would it be possible to get an email ID where I can contact you for this?

  • Keith Galli
    November 27, 2020

    Hi sir, I'm from Malaysia, just wondering does data scientist need a degree in order to land a jobs ? Does this applies anywhere in the world or just in USA only ?

  • Keith Galli
    November 27, 2020

    Awesome <3 . Waiting for the analysis of these data

  • Keith Galli
    November 27, 2020

    I'm not sure if money conversion function will be able to convert string with multiply dollar signs properly.
    like this – "$2 to $2.5 million"
    I just removed all dollar signs except the first one

  • Keith Galli
    November 27, 2020

    You make a great teacher but I suppose you already know that 😊 Let us know how we can support you! (other than DataCamp) 🧡

  • Keith Galli
    November 27, 2020

    Maybe you can make a "searching algorithms" challenges series

  • Keith Galli
    November 27, 2020

    Best instructor ever.
    Dude your lecturing skills are priceless.
    Amazing content ❤

Write a comment