Initial Load VS Incremental Load

There are two main strategies you can choose from; Incremental (sometimes referred to as delta) load and Initial (sometimes referred to as full) load. Both have their advantages and disadvantages. I will now discuss both views.

Initial/full load

This is mostly used, as the name suggests, for an initial load. It is used to load all data into an empty structure. I have seen it many times that users just stick to this type of load because it’s pretty comfortable for them.

Advantages

The advantages of this approach are that it is very easy to implement since you just truncate the whole table and reload the data again. You don’t have to bother about keys and whether there are some data not updated. Every time you reload the table, all data will be updated no matter what. In some cases, it might be a better approach than the delta load but mostly it is not because most of the data are left unchanged. Imagine a table with 5 years of data where only last month can change. You have to admit that reload the whole table every day is not a good idea.

Another positive about this is, that you don’t have to take care of the database design and keep it proper (like dtime_updated, dtime_inserted which are the most commonly used keys in the delta load). Sometimes you don’t have this luxury and you just have to improvise.

Disadvantages

As I said a while ago, reload all unchanged data is not only unnecessary but also very inconvenient (server-wise, thus performance-wise). If you need to change one record only, why would you drop and insert millions of records? It is unsustainable in the future due to a bigger and bigger load every day.

Another drawback is that you cannot keep any historical data here because you drop and reload them. Sometimes it’s wanted to keep track of all changes in a database. The initial load cannot do that!

Incremental/delta load

The incremental load is (should be) the most commonly used approach in terms of data load. Incremental load means that you only load new data into the target table and update those that were updated. This means that you work with only those data that are new or were updated after the last “sync”.

Advantages

The load is way faster compared to the initial load and consumes (significantly) fewer resources. This makes it predictable and the number of processed records will be more or less the same. You can retain historical data in the database since you only modify those that are modified (you can still preserve those old once) and insert new ones.

Disadvantages

There is a big disadvantage on a developer’s side because there is a big need to develop a complex logic and the design must have some features (such as dtime_inserted, dtime_updated to recognize which rows were updated and/or inserted during the monitored period). On the other hand, if you invest some time and develop a correct logic and design, it will pay off.