Python NumPy: Importing flat files
Learn how to import flat files using NumPy: https://www.datacamp.com/courses/importing-data-in-python-part-1
Okay so you now know how to use Python’s built-in open function to open text files. What if you now want to import a flat file and assign it to a variable? If all the data are numerical, you can use the package numpy to import the data as a numpy array. Why would we want to do this? First off, numpy arrays are the Python standard for storing numerical data. They are efficient, fast and clean. Secondly, numpy arrays are often essential for other packages, such as scikit-learn, a popular Machine Learning package for Python.
Numpy itself has a number of built-in functions that make it far easier and more efficient for us to import data as arrays. Enter the NumPy functions loadtxt and genfromtxt. To use either of these we first need to import NumPy . We then call loadtxt and pass it the filename as the first argument, along with the delimiter as the 2nd argument. Note that the default delimiter is any white space so we’ll usually need to specify it explicitly.
There are a number of additional arguments you may wish to specify. If, for example, your data consists of numerics and your header has strings in it, such as in the MNIST digits data, you will want to skip the first row by calling loadtxt with the argument skiprows = 1; if you want only the 1st and 3rd columns of the data, you’ll want to set usecols= the list containing ints 0 and 2. You can also import different datatypes into NumPy arrays: for example, setting the argument dtype = ‘str’ will ensure that all entries are imported as strings. Loadtxt is great for basic cases, but tends to break down when we have mixed datatypes, for example, columns consisting of floats AND columns consisting of strings, such as we saw in the Titanic dataset.
Now it’s your turn to have fun with loadtxt. You’ll also gain hands-on experience with other functions that can handle mixed datatypes. In the next video we’ll see that, although NumPy arrays can handle data of mixed types, the natural place for such data really is the dataframe.