Data ingestion is (almost) a solved problem | by Gianluca Gindro | Dec, 2020

[ad_1]


A review of two out of box tools to solve a traditionally tedious problem

Gianluca Gindro
Photo by Heye Jensen on Unsplash

Ask anyone who has been involved in a data related job over the past 10–15 years what is the most boring task they would rather avoid, and chances are many would answer ‘data ingestion’.

Everyone would like to have data ready to be analysed. Sure, data cleaning and pre-processing is also another good candidate to avoid, but there is something uniquely annoying in the feeling of spending time purely on replicating data, without supposedly adding any value to it. In a sense at least data cleaning gives that sense of purpose of ‘putting your house in order’.

Yet, data ingestion is a fundamental task, and, until some time ago, you had to work on a combination of API calls, CSV get requests, web-hooks, incremental loads rules, streaming services and ODBC connections just to replicate external data into your data warehouse or local system.

I’ve faced lots of these challenges myself over the past years, which is why I was particularly excited when in my last project I had the chance to explore in detail how data ingestion tools have transformed this space.

There are two data ingestion tools that I am going to review here:

When a data ingestion tool is suitable?

The ideal scenario of using a data ingestion tool is when:

  • You have a data warehouse and want to enrich it with new data sources
  • You do not need real-time synchronisation
  • Your sources are within a relatively large set of standard pre-defined databases or SaaS tools (lists here and here)
  • Your source data is already structured in a relational format or in simple JSON structures
  • Your data pipeline requirements fit an ELT approach (Extract-Load-Transform), rather than an ETL one

The ELT requirement is a crucial one here. While, I would argue, ELT vs ETL is often down to individual preference, there are scenarios where it is preferable to transform data before ingestion, for example if lots of data cleaning, transformation or filtering is required before ingestion, to avoid un-necessary complicated processing after the data has already been loaded into the DB.

When this approach is NOT suitable?

While the above use cases cover a vast range of use cases, as exclusion of points above, they won’t cover cases where:

  • You need to have a real-time replication
  • You have unstructured or heavily nested JSON source data
  • You have unsupported technologies as data source systems or data warehouse
  • You want to transform your source data before ingestion
Photo by Roman Kraft on Unsplash

Now let’s compare the two tools on a number of key dimensions.

Replication methods

An obvious advantage of using a data ingestion tool is that it will handle incremental uploads for you. If you have small enough data sources that can be replicated in full every day, of course you don’t need to worry too much about this point.

Generally speaking, both Fivetran and Stitch Data perform log-based replications, that is, they access source systems change-logs to consume the incremental changes in the data, and replicate them downstream. This is something that happens under the hood, so you don’t have to worry about that, apart from making sure your source systems have change-logs enabled.

The other point to consider is whether your data has a primary key. You obviously want to have a primary key, because this is the only way the tools can recognize that an existing record has been updated. If not, they might insert a new copy of the record each time it is modified, which is obviously an undesired behaviour.

Fivetran here is a bit more sophisticated in that it generates a synthetic primary key even when source tables have none.

Stitch Data on the other hand resorts to an append-only replication method when no primary key is present at source, as it explains in its documentation.

From my experience, in most practical cases you will deal with sources that have a primary key (and if you have control on the source databases, you need to make sure the source tables have one).

However, there are situations where this is not the case, a notable example I have encountered is the Stitch Data Mixpanel integration, which lacks a primary key. In this case, the same record can sometimes be inserted multiple times, which means you have to dedupe it at destination, for example by taking a unique combination of some fields (such as event_type, user_distinct_id, page_url and timestamp).

This is far from ideal, but should not generate too much trouble as long as you are careful with your queries to consume the data.

Handling JSON data sources

JSON data sources can in general be a pain to handle, but how much of a pain would depend on the actual data model and the tools used.

This scenario is commonly found if you have a NoSql data source such as MongoDB, but would also apply if you have a relational data source with some JSON fields, such as in a Postgres database.

The most critical case here is if you have source tables (ie collections), that are truly unstructured, that is, they contain a high number of different fields each. This could be the case, for example, if each object represents a book and its fields represent the chapter names. Chances are that the chapter names would be different for each book. Cases like this would not match well the JSON-to-relational mapping, and, both with Fivetran and Stitch Data, you would find your destination tables having as many columns as there are unique fields in the source collection. This is not only impractical, but it would also throw an error above a certain threshold, as for example Postgres has a limit of 1600 columns per table.

If instead you have a limited number of unique fields in each collection (say, up to a few hundreds at the very most), the behaviour would change depending on the tools:

  • Fivetran will map each JSON field at source to a JSON field at destination, as long as the destination DB handles dictionary-type structures
  • Stitch Data will do the same, but only for a limited set of destinations, such as Snowflake, but for example not with Postgres. This is quite annoying as Postgres natively handles JSON fields. More details on this are explained here, and you also need to bear in mind that this would also affect your row count for pricing (more on that later).

In summary, there is not any magic that a tool can do if you have truly unstructured source data, however, if you do have some structure, under certain conditions these tools will allow to preserve the dictionary-like format in the destination database.

Tables and field selection

Both Fivetran and Stitch Data allow to only import certain tables from a source schema, but Stitch Data here also offers the advantage of being able to selectively import only chosen fields, which can be a useful advantage if you have tables with lots of fields, and particularly if some of these fields contain nested JSON that could complicate the target structure.

Row count and charging methods

On this point, Fivetran and Stitch Data have two different approaches:

  • Fivetran charges on Unique monthly rows updated
  • Stitch Data on overall monthly rows updated (that is, the same row can be counted multiple times)

How does this matter? If your source stream is mainly a sequence of new records appended, this won’t make it much difference, but if your source systems tend to frequently update the same records over and over, Fivetran here can offer an advantage of more clear and predictable pricing.

Another point to be very careful here is about historical backfills and table schema changes. If you have for example a large table and you add one additional column to it and populate it for all the history, both systems will spot these changes and will update the full records history, even if you might not be genuinely interested in that additional field. This could generate quite high unexpected charges for both tools.

Pricing

Based on the charging methods described above, it’s not exactly possible to compare like for like the two tools, however, in a common scenario where the vast majority of source changes are actually new inserts, Stitch Data proves to be significantly cheaper than Fivetran.

As en example, as of today:

  • Stitch Data has a pricing plan that starts at $100/month for 5M modified rows a month, with a month-by-month commitment and 10 integrations sources. The next level is $180 for 10M rows.
  • Fivetran pricing is a bit more obscure, and you need to navigate through their documentation pages a bit to properly understand it. In summary, it practically starts at $1000–1500 a month for 1M unique rows updated a month, but it then increases more gently for higher consumption volumes.

Replication frequency

The replication frequency that you can set-up will depend both on the tool used and the source system. Fivetran claims a lower limit of 5 minutes, Stitch Data for most sources has a lower limit of 30 minutes, however some sources allow a lower value, but you need to be careful in general not to overlap between two subsequent runs.

Your replication frequency will also affect the number of rows modified, which would particularly affect Stitch Data pricing.

Transformation capabilities

Let’s remember we are talking about ingestion, not data transformation, so it’s not surprising that neither of these tools have a focus on data transformation capabilities.

However, Fivetran also offers integration with dbt (data build tool) and Stitch Data is part of the Talend family, traditionally a big player in the ETL domain.

However, there are no obvious advantages in using their sister data transformation tools, so that part can be genuinely decoupled from ingestion.

Outcome

Both Fivetran and Stitch Data are great tools for data ingestion. Which one fits the bill would depend on the actual use cases and needs.

Fivetran has in some way more complex replication capabilities, but this also comes with an increased cost.

Stitch Data is a great entry-level tool, which can also handle replication at scale for a number of standard scenarios.

Fivetran pricing is quite steep at the beginning, but can become more justifiable if you have very high ingestion volumes (eg billions unique rows a month), and can also offer more predictability due to its unique rows pricing model.

Read More …

[ad_2]


Write a comment