Python tutorial: HTTP requests to import data from the web
Learn how to perform HTTP requests to import web data with Python: https://www.datacamp.com/courses/importing-data-in-python-part-2
Congrats on importing your first web data! In order to import files from the web, we used the urlretrieve function from urllib.requests. Lets now unpack this a bit and, in the process, understand a few things about how the internet works:
URL stands for Uniform or Universal Resource Locator and all they really are are references to web resources. The vast majority of URLs are web addresses, but they can refer to a few other things, such as file transfer protocols (FTP) and database access. We’ll currently focus on those URLs that are web addresses OR the locations of websites.
Such a URL consists of 2 parts:
A protocol identifier http ot https and
A resource name such as datacamp.com
The combination of protocol identifier and resource name uniquely specifies the web address!
To explain URLs, I have introduced yet another acronym http, which itself stands for HyperText Transfer Protocol. Wikipedia provides a great description of HTTP:
The Hypertext Transfer Protocol (HTTP) is an application protocol for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web.
Note that HTTPS is a more secure form of HTTP. Each time you go to a website, you are actually sending an HTTP request to a server. This request is known as a GET request, by far the most common type of HTTP request.
We are actually performing a GET request when using the function urlretrieve. The ingenuity of urlretrieve also lies in fact that it not only makes a GET request but also saves the relevant data locally.
In the following, you’ll learn how to make more GET requests to store web data in your environment. In particular, you’ll figure out how to get the HTML data from a webpage. HTML stands for Hypertext Markup Language and is the standard markup language for the web.
To extract the html from the wikipedia home page, you
Import the necessary functions;
Specify the URL;
Package the GET request using the function Request;
Send the request and catch the response using the function urlopen;
This returns an HTTPResponse object, which has an associated read method;
You then apply this read method to the response, which returns the HTML as a string, which you store in the variable html.
You remember to be polite and close the response!
Now we are going to do the same, however here we’ll use the requests package, which provides a wonderful API for making requests. According to the requests package website:
Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor.
and the following organizations claim to use requests internally:
Her Majesty’s Government, Amazon, Google, Twilio, NPR, Obama for America, Twitter, Sony, and Federal U.S. Institutions that prefer to be unnamed
Requests is one of the most downloaded Python packages of all time, pulling in over 7,000,000 downloads every month. All the cool kids are doing it!
Lets now see requests at work:
Import the package requests;
Specify the URL;
Package the request, send the request and catch the response with a single function requests.get();
Apply the text method to the response which returns the HTML as a string;
That’s enough out of me for the time being: let’s get you hacking away at pulling down some HTML from the web using GET requests!