Everything starts with the design. With an incorrect design, you will not be able to do practically anything and everything will be nothing but a struggle. I will provide you with some tips here to make your life easier 😉
Define the Purpose of Your Database
Many neophytes do not realize how huge is the difference between OLTP and OLAP systems.
OLTP (OnLine Transactional Processing) is, as its name suggests, a transactional system. This system has to cater hundreds of thousands, millions, or sometimes even billions of transactions within a short period of time. Performance (speed) is the number one priority here; therefore the design has to be aligned; everything should be a matter of milliseconds only. Each transaction processes only a small amount of data (mostly INSERT, UPDATE, DELETE, and a few simple SELECTs over a single table). OLTP’s design should be highly normalized (3rd NF at least) and must have clearly defined relationships between objects. There are only data needed for production use; strictly not for reporting! All historical data (not needed by the application) should be archived in OLAP.
OLAP (OnLine Analytical Processing) is, as its name suggests, an analytical system; frequently referred to as Data Warehouse. This system has to cater fewer transactions but processes an enormous amount of data. Performance is not really the key here; it doesn’t matter whether users will wait for their result 1 or 5 mins (which is totally unacceptable in OLTP’s case). Queries in OLAP are usually complex involving multiple tables with sophisticated conditions. It’s not rare to see de-normalized tables in OLAP which help users to simplify their queries and avoid unnecessary joins in their queries. You can see partitions quite often here as well, for we are talking about terabytes of data to analyze trends, make decisions, monitor business, etc.
Think Ahead – Scalability/Flexibility
When designing any database, you have to make it as scalable as possible. Do not design it to the current needs or requirements only. Always try to anticipate what could business need in the future and how it would impact your design.
Would you have to completely redo the design? Would you have to redesign some parts of it only? Or would you just add some more attributes or objects and you will be ready to roll?
Some might find this funny, but I saw several times where a table was structured more in a columnar way rather than row-oriented. Believe me, it’s a nightmare for everyone. Every single change required changes in all related applications and processes and that’s definitely not the way to go 🙁 So storing your products in columns is not really a good idea 🙂
Try to be as abstract as possible and keep it open for other possible scenarios which might come in the future.
Organize Your Data
Always start with paper. Try to create “areas” (dimensions) in your design where you will “group” objects together. Create entities and their attributes and think about the relationship between all entities. Avoid duplications and redundancy; that’s your biggest foe.
“Do NOT overdo it!”
The business will generally tell you they need everything. That’s, of course, not technically possible. Actually it is, but not worth the cost 🙂 Your task is to challenge them and come up with a suitable design that will track and store only data vital for business decisions and operational purposes. You don’t probably need to track every single change of “profile.description” during the day. You might wanna see only the last one during the day.
Storing everything might have a significant impact on your performance and of course resource costs. The ultimate question is: “Is this worth the information value?”.
Normalization vs Denormalization
As I mentioned above, there are cases where denormalization is actually welcome 🙂
It’s crucial to design this properly and I can tell it’s one of the most disregarded activities. Neglecting this can lead to severe data inconsistencies. On the other hand, properly designing not null constraints, primary and foreign keys, their relationship, default values, unique constraints, and so on, will help you maintain the quality over your database. It takes time, of course, but pays off a thousand times.
Timestamps and other “stamps”
The database is not just a bag full of data, because it’s not about having information but being able to use it. Without proper “stamps” you won’t be able to tell what is happening in your application (=database). Some of the key stamps are:
- when the information was created/inserted
- when the information was updated
- whether the information is still valid/active
- who inserted the information
- who updated the information
Without these, you won’t be able to tell how old that information is and what was the evolution of this piece of data (changes over time). Moreover, it will be hard to do any loads to other systems. Read more about Initial load vs Incremental Load.
There are several ways to archive data in order to see the whole picture. You want to see what statuses the order went through, how long did each of them take, when did it start, when did it end, and so on and so forth. It would clearly not be sufficient if you had only one status (the actual one).
In Oracle, we call them SCD (Slowly Changing Dimensions). I am not going to describe them in detail now, check the official documentation if you are interested. In short, there are three basic SCDs:
- SCD 1 – overwriting ( might sound silly 🙂 but is very useful in some cases )
- SCD 2 – new row for each change (keeping all the previous values)
- SCD 3 – new column for each change (keeping in another column previous value to track the changes)
Each company has its own 🙂 There is nothing wrong about that as long as you stick to it in all cases. All objects should have its own unique identifier to clearly tell what it is at a first glance.
When speaking about naming conventions, people usually think about tables and views. But it goes way further than that. Don’t forget the rest of the objects:
- Keys (all – primary, foreign, …)
- Stored procedures
- Scheduler Jobs
Here are some rules you should follow:
- all objects should have clear and self-explanatory names
- trigger1, table_data, scheduled_job_sales – these make no sense at all
- only letters, numbers, and underscore signs should be allowed
- all names should be in UPPERCASE and contain no space
- the first character must be a letter
- keep the names meaningful and readable at first glance but do not use too long names
- if you need to describe a comparison use abbreviation instead
|lower than or equal to||<=||lte||avg_age_lte_50|
|greater than or equal to||>=||gte||avg_age_gte_50|
- all names should have a singular form
- this might be a bit controversial .. but that’s what I recommend
- always use meaningful table aliases in your queries
- I know I might sound geeky, but seeing aliases: a,b,c,d,…. drives me nuts
- always prepend schema names in front of your table names – always!
- all column names should be consistent across all objects
- having product.id and order.id_product or order.product_id makes no sense
- agree on one option and stick to it (I personally prefer “id_product” esp with the leading id_ prefix .. because when you are looking for all IDs in the database, you can only filter all columns LIKE ‘ID%’
- same applies to other column names
Here is a little example:
|Schema/User||apl_(.*) - for application accounts
usr_(.*) - for user accounts
|Role||role_(.*)_rw - read write
role_(.*)_ro - read only
role_(.*)_own - owner
Description / Comments
To comment or not to comment – eternal question 🙂 I believe that reasonable and meaningful commenting is fine and brings a certain value. Sometimes you cannot make the column name or table name clear enough so some tips and hints might be very useful.
Data Insulation / Protection
One of the best practices you can do for data protection, integrity, and assurance for your applications or users to have a seamless experience in the database – is the insulation, where only views are accessible to users.
- customer_dct – table name (not accessible for users)
- customer_dcv – corresponding view name (accessible for users).
--example CREATE VIEW customer_dcv AS SELECT * FROM customer_dct;
That means, clients/users should not be accessing tables but views only. This has several advantages such as
- when you need to change the data source, you will only change the table and remap the view, without impacting the users, applications, and processes
- you can mask/hide some data on the views if you want to protect them
And this is totally awesome – I highly recommend this approach 🙂 With this, you can completely control the access and data flow out of your database.
This article barely describes all the necessary features and things to do, it was rather meant to highlight some critical parts to consider.