Add a Column in a Pandas DataFrame Based on an If-Else Condition
When we’re doing knowledge evaluation with Python, we’d generally wish to add a column to a pandas DataBody based mostly on the values in different columns of the DataBody.
Although this sounds simple, it will probably get a bit sophisticated if we attempt to do it utilizing an if-else conditional. Thankfully, there’s a easy, wonderful means to do that utilizing numpy!
To learn to use it, let’s take a look at a particular knowledge evaluation query. We’ve obtained a dataset of greater than 4,000 Dataquest tweets. Do tweets with connected photos get extra likes and retweets? Let’s do some evaluation to search out out!
We’ll begin by importing pandas and numpy, and loading up our dataset to see what it seems like. (If you’re not already acquainted with utilizing pandas and numpy for knowledge evaluation, take a look at our interactive numpy and pandas course).
import pandas as pd import numpy as np df = pd.read_csv('dataquest_tweets_csv.csv') df.head()
We can see that our dataset incorporates a little bit of details about every tweet, together with:
date— the date the tweet was posted
time— the time of day the tweet was posted
tweet— the precise textual content of the tweet
mentions— every other twitter customers talked about in the tweet
photographs— the url of any photos included in the tweet
replies_count— the variety of replies on the tweet
retweets_count— the variety of retweets of the tweet
likes_count— the variety of likes on the tweet
We can even see that the
photographs knowledge is formatted a bit oddly.
Adding a Pandas Column with a True/False Condition Using np.the place()
For our evaluation, we simply wish to see whether or not tweets with photos get extra interactions, so we don’t really want the picture URLs. Let’s attempt to create a new column referred to as
hasimage that may comprise Boolean values —
True if the tweet included an picture and
False if it didn’t.
To accomplish this, we’ll use numpy’s built-in
the place() operate. This operate takes three arguments in sequence: the situation we’re testing for, the worth to assign to our new column if that situation is true, and the worth to assign whether it is false. It seems like this:
np.the place(situation, worth if situation is true, worth if situation is fake)
In our knowledge, we will see that tweets with out photos at all times have the worth
 in the
photographs column. We can use info and
np.the place() to create our new column,
hasimage, like so:
df['hasimage'] = np.the place(df['photos']!= '', True, False) df.head()
Above, we will see that our new column has been appended to our knowledge set, and it has accurately marked tweets that included photos as
True and others as
Now that we’ve obtained our
hasimage column, let’s shortly make a couple of recent DataFrames, one for all of the picture tweets and one for all the no-image tweets. We’ll do this utilizing a Boolean filter:
image_tweets = df[df['hasimage'] == True] no_image_tweets = df[df['hasimage'] == False]
Now that we have created these, we will use built-in pandas math capabilities like
.imply() to shortly examine the tweets in every DataBody.
print() statements to make the outcomes a little simpler to learn. We’ll additionally want to recollect to make use of
str() to transform the results of our
.imply() calculation into a string in order that we will use it in our print assertion:
print('Average likes, all tweets: ' + str(df['likes_count'].imply())) print('Average likes, picture tweets: ' + str(image_tweets['likes_count'].imply())) print('Average likes, no picture tweets: ' + str(no_image_tweets['likes_count'].imply())) print('n') print('Average RTs, all tweets: ' + str(df['retweets_count'].imply())) print('Average RTs, picture tweets: ' + str(image_tweets['retweets_count'].imply())) print('Average RTs, no picture tweets: ' + str(no_image_tweets['retweets_count'].imply()))
Average likes, all tweets: 6.209759328770148 Average likes, picture tweets: 14.21042471042471 Average likes, no picture tweets: 5.176514584891549 Average RTs, all tweets: 1.5553102230072864 Average RTs, picture tweets: 3.5386100386100385 Average RTs, no picture tweets: 1.2991772625280478
Based on these outcomes, it looks as if together with photos might promote extra Twitter interplay for Dataquest. Tweets with photos averaged practically 3 times as many likes and retweets as tweets that had no photos.
Adding a Pandas Column with More Complicated Conditions
That method labored nicely, however what if we wished so as to add a new column with extra advanced circumstances — one which goes past True and False?
For instance, to dig deeper into this query, we’d wish to create a few interactivity “tiers” and assess what share of tweets that reached every tier contained photos. For simplicity’s sake, lets use Likes to measure interactivity, and separate tweets into 4 tiers:
tier_4— 2 or fewer likes
tier_3— 3-9 likes
tier_2— 10-15 likes
tier_1— 16+ likes
To accomplish this, we will use a operate referred to as
np.choose(). We’ll give it two arguments: a record of our circumstances, and a correspding record of the worth we’d prefer to assign to every row in our new column.
This implies that the order issues: if the primary situation in our
circumstances record is met, the primary worth in our
values record can be assigned to our new column for that row. If the second situation is met, the second worth can be assigned, et cetera.
Let’s take a take a look at how this seems in Python code:
circumstances = [ (df['likes_count'] <= 2), (df['likes_count'] > 2) & (df['likes_count'] <= 9), (df['likes_count'] > 9) & (df['likes_count'] <= 15), (df['likes_count'] > 15) ] values = ['tier_4', 'tier_3', 'tier_2', 'tier_1'] df['tier'] = np.choose(circumstances, values) df.head()
Awesome! We’ve created one other new column that categorizes every tweet based mostly on our (admittedly considerably arbitrary) tier rating system.
Now, we will use this to reply extra questions on our knowledge set. For instance: what share of tier 1 and tier Four tweets have photos?
df[(df['tier'] == 'tier_4')]['hasimage'].value_counts(normalize=True)
False 0.948784 True 0.051216 Name: hasimage, dtype: float64
df[(df['tier'] == 'tier_1')]['hasimage'].value_counts(normalize=True)
False 0.836842 True 0.163158 Name: hasimage, dtype: float64
Here, we will see that whereas photos appear to assist, they don’t appear to be essential for fulfillment. More than 83% of Dataquest’s “tier 1” tweets — the tweets with 15+ likes — had no picture connected.
While that is a very superficial evaluation, we’ve achieved our true aim right here: including columns to pandas DataFrames based mostly on conditional statements about values in our present columns.
Of course, that is a activity that may be achieved in a extensive number of methods.
np.the place() and
np.choose() are simply two of many potential approaches. If you’d prefer to be taught extra of this form of factor, take a look at Dataquest’s interactive Numpy and Pandas course, and the opposite programs in the Data Scientist in Python profession path.