Add a Column in a Pandas DataFrame Based on an If-Else Condition



When we’re doing knowledge evaluation with Python, we’d generally wish to add a column to a pandas DataBody based mostly on the values in different columns of the DataBody.

Although this sounds simple, it will probably get a bit sophisticated if we attempt to do it utilizing an if-else conditional. Thankfully, there’s a easy, wonderful means to do that utilizing numpy!

To learn to use it, let’s take a look at a particular knowledge evaluation query. We’ve obtained a dataset of greater than 4,000 Dataquest tweets. Do tweets with connected photos get extra likes and retweets? Let’s do some evaluation to search out out!

We’ll begin by importing pandas and numpy, and loading up our dataset to see what it seems like. (If you’re not already acquainted with utilizing pandas and numpy for knowledge evaluation, take a look at our interactive numpy and pandas course).

import pandas as pd
import numpy as np

df = pd.read_csv('dataquest_tweets_csv.csv')
adding a column to a dataframe in pandas step 1: baseline dataframe

We can see that our dataset incorporates a little bit of details about every tweet, together with:

  • date — the date the tweet was posted
  • time — the time of day the tweet was posted
  • tweet — the precise textual content of the tweet
  • mentions — every other twitter customers talked about in the tweet
  • photographs — the url of any photos included in the tweet
  • replies_count — the variety of replies on the tweet
  • retweets_count — the variety of retweets of the tweet
  • likes_count — the variety of likes on the tweet

We can even see that the photographs knowledge is formatted a bit oddly.

Adding a Pandas Column with a True/False Condition Using np.the place()

For our evaluation, we simply wish to see whether or not tweets with photos get extra interactions, so we don’t really want the picture URLs. Let’s attempt to create a new column referred to as hasimage that may comprise Boolean values — True if the tweet included an picture and False if it didn’t.

To accomplish this, we’ll use numpy’s built-in the place() operate. This operate takes three arguments in sequence: the situation we’re testing for, the worth to assign to our new column if that situation is true, and the worth to assign whether it is false. It seems like this:

np.the place(situation, worth if situation is true, worth if situation is fake)

In our knowledge, we will see that tweets with out photos at all times have the worth [] in the photographs column. We can use info and np.the place() to create our new column, hasimage, like so:

df['hasimage'] = np.the place(df['photos']!= '[]', True, False)
new column based on if-else has been added to our pandas dataframe

Above, we will see that our new column has been appended to our knowledge set, and it has accurately marked tweets that included photos as True and others as False.

Now that we’ve obtained our hasimage column, let’s shortly make a couple of recent DataFrames, one for all of the picture tweets and one for all the no-image tweets. We’ll do this utilizing a Boolean filter:

image_tweets = df[df['hasimage'] == True]
no_image_tweets = df[df['hasimage'] == False]

Now that we have created these, we will use built-in pandas math capabilities like .imply() to shortly examine the tweets in every DataBody.

We’ll use print() statements to make the outcomes a little simpler to learn. We’ll additionally want to recollect to make use of str() to transform the results of our .imply() calculation into a string in order that we will use it in our print assertion:

print('Average likes, all tweets: ' + str(df['likes_count'].imply()))
print('Average likes, picture tweets: ' + str(image_tweets['likes_count'].imply()))
print('Average likes, no picture tweets: ' + str(no_image_tweets['likes_count'].imply()))

print('Average RTs, all tweets: ' + str(df['retweets_count'].imply()))
print('Average RTs, picture tweets: ' + str(image_tweets['retweets_count'].imply()))
print('Average RTs, no picture tweets: ' + str(no_image_tweets['retweets_count'].imply()))
Average likes, all tweets: 6.209759328770148
Average likes, picture tweets: 14.21042471042471
Average likes, no picture tweets: 5.176514584891549

Average RTs, all tweets: 1.5553102230072864
Average RTs, picture tweets: 3.5386100386100385
Average RTs, no picture tweets: 1.2991772625280478

Based on these outcomes, it looks as if together with photos might promote extra Twitter interplay for Dataquest. Tweets with photos averaged practically 3 times as many likes and retweets as tweets that had no photos.

Adding a Pandas Column with More Complicated Conditions

That method labored nicely, however what if we wished so as to add a new column with extra advanced circumstances — one which goes past True and False?

For instance, to dig deeper into this query, we’d wish to create a few interactivity “tiers” and assess what share of tweets that reached every tier contained photos. For simplicity’s sake, lets use Likes to measure interactivity, and separate tweets into 4 tiers:

  • tier_4 — 2 or fewer likes
  • tier_3 — 3-9 likes
  • tier_2 — 10-15 likes
  • tier_1 — 16+ likes

To accomplish this, we will use a operate referred to as np.choose(). We’ll give it two arguments: a record of our circumstances, and a correspding record of the worth we’d prefer to assign to every row in our new column.

This implies that the order issues: if the primary situation in our circumstances record is met, the primary worth in our values record can be assigned to our new column for that row. If the second situation is met, the second worth can be assigned, et cetera.

Let’s take a take a look at how this seems in Python code:

circumstances = [
    (df['likes_count'] <= 2),
    (df['likes_count'] > 2) & (df['likes_count'] <= 9),
    (df['likes_count'] > 9) & (df['likes_count'] <= 15),
    (df['likes_count'] > 15)

values = ['tier_4', 'tier_3', 'tier_2', 'tier_1']

df['tier'] = np.choose(circumstances, values)


Awesome! We’ve created one other new column that categorizes every tweet based mostly on our (admittedly considerably arbitrary) tier rating system.

Now, we will use this to reply extra questions on our knowledge set. For instance: what share of tier 1 and tier Four tweets have photos?

df[(df['tier'] == 'tier_4')]['hasimage'].value_counts(normalize=True)
False    0.948784
True     0.051216
Name: hasimage, dtype: float64

df[(df['tier'] == 'tier_1')]['hasimage'].value_counts(normalize=True)
False    0.836842
True     0.163158
Name: hasimage, dtype: float64

Here, we will see that whereas photos appear to assist, they don’t appear to be essential for fulfillment. More than 83% of Dataquest’s “tier 1” tweets — the tweets with 15+ likes — had no picture connected.

While that is a very superficial evaluation, we’ve achieved our true aim right here: including columns to pandas DataFrames based mostly on conditional statements about values in our present columns.

Of course, that is a activity that may be achieved in a extensive number of methods. np.the place() and np.choose() are simply two of many potential approaches. If you’d prefer to be taught extra of this form of factor, take a look at Dataquest’s interactive Numpy and Pandas course, and the opposite programs in the Data Scientist in Python profession path.


Source hyperlink

Write a comment