(Tutorial) Python Select Columns – DataCamp
[ad_1]
If you have a DataFrame and would like to access or select a specific few rows/columns from that DataFrame, you can use square brackets or other advanced methods such as loc
and iloc
.
Selecting Columns Using Square Brackets
Now suppose that you want to select the country column from the brics
DataFrame. To achieve this, you will type brics
and then the column label inside the square brackets.
Selecting a Column
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Dehli 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
brics["country"]
BR Brazil
RU Russia
IN India
CH China
SA South Africa
Name: country, dtype: object
Checking the Type of the Object
Let’s check the type of the object that gets returned with the type function.
type(brics["country"])
pandas.core.series.Series
As we can see from the above output, we are dealing with a pandas
series here! Series could be thought of as a one-dimensional array that could be labeled just like a DataFrame.
If you want to select data and keep it in a DataFrame, you will need to use double square brackets:
brics[["country"]]
BR Brazil
RU Russia
IN India
CH China
SA South Africa
If we check the type of this output, it’s a DataFrame! With only one column, though.
type(brics[["country"]])
pandas.core.frame.DataFrame
Selecting Multiple Columns
You can extend this call to select two columns. Let’s try to select country
and capital
.
brics[["country", "capital"]]
country capital
BR Brazil Brasilia
RU Russia Moscow
IN India New Dehli
CH China Beijing
SA South Africa Pretoria
If you look at this closely, you are actually putting a list with column labels inside another set of square brackets and end up with a sub DataFrame containing only the country
and capital
columns.
Selecting Rows Using Square Brackets
Square brackets can do more than just selecting columns. You can also use them to get rows, or observations, from a DataFrame.
Example
You can only select rows using square brackets if you specify a slice, like 0:4
. Also, you’re using the integer indexes of the rows here, not the row labels!
To get the second, third, and fourth rows of brics
DataFrame, we use the slice 1 through 4. Remember that end the of the slice is exclusive, and the index starts at zero.
brics[1:4]
country capital area population
RU Russia Moscow 17.100 143.50
IN India New Dehli 3.286 1252.00
CH China Beijing 9.597 1357.00
These square brackets work, but they only offer limited functionality. Ideally, we would want something similar to 2D Numpy
arrays, where you also use square brackets. The index, or slice, before the comma refers to the rows, and the slice after the comma refers to the columns.
Example of 2D Numpy
array:
my_array[rows, columns]
If you want to do something similar with pandas
, you need to look at using the loc
and iloc
functions.
loc
: label-basediloc
: integer position-based
loc
Function
loc
is a technique to select parts of your data based on labels. Let’s look at the brics DataFrame and get the rows for Russia.
To achieve this, you will put the label of interest in square brackets after loc
.
Selecting Rows
brics.loc["RU"]
country Russia
capital Moscow
area 17.1
population 143.5
Name: RU, dtype: object
We get a pandas
series containing all of the rows information; inconveniently, though, it is shown on different lines. To get a DataFrame, we have to put the RU
sting in another pair of brackets. We can also select multiple rows at the same time. Suppose you want to also include India and China. Simply add those row labels to the list.
brics.loc[["RU", "IN", "CH"]]
country capital area population
RU Russia Moscow 17.100 143.50
IN India New Dehli 3.286 1252.00
CH China Beijing 9.597 1357.00
The difference between using a loc
and basic square brackets is that you can extend the selection with a comma and a specification of the columns of interest.
Selecting Rows and Columns
Let’s extend the previous call to only include the country
and capital
columns. We add a comma and list the column labels we want to keep. The intersection gets returned.
brics.loc[["RU", "IN", "CH"], ["country", "capital"]]
country capital
RU Russia Moscow
IN India New Dehli
CH China Beijing
You can also use loc
to select all rows but only a specific number of columns. Simply replace the first list that specifies the row labels with a colon. A slice going from beginning to end. This time, we get back all of the rows but only two columns.
Selecting All Rows and Specific Columns
brics.loc[:, ["country", "capital"]]
country capital
BR Brazil Brasilia
RU Russia Moscow
IN India New Dehli
CH China Beijing
SA South Africa Pretoria
iloc
Function
The iloc
function allows you to subset pandas
DataFrames based on their position or index.
Selecting Rows
Let’s use the same data and similar examples as we did for loc
. Let’s start by getting the row for Russia.
brics.iloc[[1]]
country capital area population
RU Russia Moscow 17.100 143.50
To get the rows for Russia, India, and China. You can now use a list of index 1, 2, 3
.
brics.iloc[[1, 2, 3]]
country capital area population
RU Russia Moscow 17.100 143.50
IN India New Dehli 3.286 1252.00
CH China Beijing 9.597 1357.00
Selecting Rows and Columns
Similar to loc
, we can also select both rows and columns using iloc
. Here, we will select rows for Russia, India, and China and columns country
and capital
.
brics.iloc[[1, 2, 3], [0, 1]]
country capital
RU Russia Moscow
IN India New Dehli
CH China Beijing
Selecting All Rows and Specific Columns
Finally, if you wanted to select all rows but just keep the country
and capital
columns, you can:
brics.loc[:, [0, 1]]
country capital
BR Brazil Brasilia
RU Russia Moscow
IN India New Dehli
CH China Beijing
SA South Africa Pretoria
loc
and iloc
functions are pretty similar. The only difference is how you refer to columns and rows.
Interactive Example on Selecting a Subset of Data
In the following example, the cars data is imported from a CSV files as a Pandas DataFrame. To select only the cars_per_cap column
from cars, you can use:
cars['cars_per_cap']
cars[['cars_per_cap']]
The single bracket version gives a Pandas Series; the double bracket version gives a Pandas DataFrame.
- You will use single square brackets to print out the
country
column ofcars
as a Pandas Series. - Then use double square brackets to print out the
country
column ofcars
as a Pandas DataFrame. - Finally, use the double square brackets to print out a DataFrame with both the
country
anddrives_right
columns ofcars
, in this order.
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Print out country column as Pandas Series
print(cars['country'])
# Print out country column as Pandas DataFrame
print(cars[['country']])
# Print out DataFrame with country and drives_right columns
print(cars[['country', 'drives_right']])
When we run the above code, it produces the following result:
US United States
AUS Australia
JPN Japan
IN India
RU Russia
MOR Morocco
EG Egypt
Name: country, dtype: object
country
US United States
AUS Australia
JPN Japan
IN India
RU Russia
MOR Morocco
EG Egypt
country drives_right
US United States True
AUS Australia False
JPN Japan False
IN India False
RU Russia True
MOR Morocco True
EG Egypt True
To learn more about pandas, please see this video from our course Intermediate Python.
This content is taken from DataCamp’s Intermediate Python course by Hugo Bowne-Anderson.
[ad_2]
Source link