Loading and Cleaning Data with R and the tidyverse


Messy datasets are in all places. If you wish to analyze information, it’s inevitable that you’ll want to wash information. On this tutorial, we’re going to check out how to do this utilizing R and a few nifty tidyverse instruments.

load clean and analyze brooklyn real estate data with the tidyverse

We’ll load, clear, and prep some Brooklyn actual property information for evaluation utilizing R and the tidyverse!

The tidyverse tools present highly effective strategies to diagnose and clear messy datasets in R. Whereas there’s way more we are able to do with the tidyverse, on this tutorial we’ll concentrate on studying the best way to:

  • Import comma-separated values (CSV) and Microsoft Excel flat information into R
  • Mix information frames
  • Clear up column names
  • And extra!

The tidyverse is a collection of R packages designed for working with information. The tidyverse packages share a standard design philosophy, grammar, and information constructions. Tidyverse packages “play nicely collectively”. The tidyverse lets you spend much less time cleansing information so to focus extra on analyzing, visualizing, and modeling information.

1. Traits of Clear Knowledge and Messy Knowledge

What precisely is clear information? Clear information is correct, full, and in a format that is able to analyze. Traits of unpolluted information embody information which can be:

  • Freed from duplicate rows/values
  • Error-free (e.g. freed from misspellings)
  • Related (e.g. freed from particular characters)
  • The suitable information sort for evaluation
  • Freed from outliers (or solely include outliers have been recognized/understood), and
  • Follows a “tidy information” construction

Widespread signs of messy information embody information that include:

  • Particular characters (e.g. commas in numeric values)
  • Numeric values saved as textual content/character information sorts
  • Duplicate rows
  • Misspellings
  • Inaccuracies
  • White area
  • Lacking information
  • Zeros as a substitute of null values

2. Motivation

On this weblog publish, we’ll work with 5 property-sales datasets which can be publicly out there on the New York City Department of Finance Rolling Sales Data web site. We encourage you to obtain the datasets and observe alongside! Every file incorporates one 12 months of actual property gross sales information for considered one of New York Metropolis’s 5 boroughs. We are going to work with the next Microsoft Excel information:

  • rollingsales_bronx.xls
  • rollingsales_brooklyn.xls
  • rollingsales_manhattan.xls
  • rollingsales_queens.xls
  • rollingsales_statenisland.xls

As we work via this weblog publish, think about that you’re serving to a good friend launch their home-inspection enterprise in New York Metropolis. You provide to assist them by analyzing the info to raised perceive the real-estate market. However you understand that earlier than you’ll be able to analyze the info in R, you’ll need to diagnose and clear it first. And earlier than you’ll be able to diagnose the info, you’ll need to load it into R!

3. Load Knowledge into R with readxl

Advantages of utilizing tidyverse instruments are sometimes evident within the data-loading course of. In lots of circumstances, the tidyverse package deal readxl will clear some information for you as Microsoft Excel information is loaded into R. In case you are working with CSV information, the tidyverse readr package deal perform read_csv() is the perform to make use of (we’ll cowl that later).

Let’s take a look at an instance. Right here’s how the Excel file for the Brooklyn borough seems: