SELECT statement (decomposed)

The most commonly used and basic type of SQL (stands for Structured Query Language) is a SELECT statement.

SELECT

With the SELECT statement, we can retrieve data from a table. There are several ways of how to retrieve the data and I will try to highlight not only the best practices but also the worst practices with a demonstration why this particular approach is wrong and the other one is correct.

SELECT statement consists of 7even sections (2 are mandatory, 5 are optional)

MANDATORY

  • SELECT
  • FROM

OPTIONAL

  • JOIN -> left, right, full / outer, inner, cross
  • WHERE
  • GROUP BY
  • HAVING
  • ORDER BY

As you will go through these notes, I will explain each and every section in detail.

Let’s start with the mandatory section. To extract data from a table, the simplest way how to do it is to use a generic query as below:

SELECT * | {[DISTINCT] column | expression [alias], ...}
FROM table

* – asterisk represents all column in a given table

DISTINCT

  • What does it do?
    • DISTINCT clause allows us to get only unique set of selected columns/expressions
  • When to use it?
    • use distinct only if you want to see unique values for some manual checking
    • you can achieve the same result by using count() function which is recommended over using DISTINCT because the cost is the same for both queries but with count() you get additional information
  • When to NOT use it?
    • Please avoid using DISTINCT clause in your queries for reporting or even for data loading
    • It has a serious (negative) impact on the performance of your scripts not to mention that it is a sign that developers/analysts do not know what they are doing. Instead of analyzing their script (most often an incorrect JOIN condition) to eliminate duplicates, they simply add DISTINCT at the beginning of their query.
    • Tip: To debug the query (identify the culprit / JOIN) – I always start with the base_table (the one that is after FROM clause) and run count over the records. Then I gradually add more and more joins until I find the one causing duplicates.
  • Why to NOT use it?
    • As it was said, it has a negative impact on the performance of your script because Oracle has to select all values first, sort them and remove all duplicates
    • You can picture it as if someone will give you millions of papers with different words and you need to extract only those that are unique … dreadful, right? How about if you get already unique words? 🙂
  • How to avoid using it?
    • Always debug your query and find the root cause which causes the duplicates. In 99% cases, it is an incorrect JOIN condition which you can fix easily in the code. The rest could be data duplicity in the source tables which needs a little more effort to cleanse (identify how the table is populated, check the code that is putting data into the data, fix it and reload the table or cleanse data inside).
  • GROUP BY vs DISTINCT
    • DISTINCT and GROUP BY have the same effect on the result. Both will eliminate duplicates and the cost for both is the same.
-- DISTINCT vs GROUP BY demonstration

set serveroutput on
set linesize 255 
set echo on 

set autotrace traceonly explain

    SELECT DISTINCT owner
                   ,object_name
                   ,object_type
    FROM all_objects
    WHERE 1 = 1
    AND owner = 'OWNER_DWH'
    ;
    
    SELECT owner
          ,object_name
          ,object_type
    FROM all_objects
    WHERE 1 = 1
    AND owner = 'OWNER_DWH'
    GROUP BY owner
            ,object_name
            ,object_type
    ;

set autotrace off 

-- manual v$sql check

SELECT sq.sql_id
, sq.sql_text
, sq.command_type ct
, sq.executions exs
, sq.rows_processed rws
--
, sq.buffer_gets bg
, sq.disk_reads dr
--              
, sq.elapsed_time et
, sq.cpu_time cput
, sq.user_io_wait_time uiot 
--
FROM v$sql sq
--
WHERE 1=1
AND sql_text LIKE 'SELECT owner%'
OR sql_text LIKE 'SELECT DISTINCT owner%'

FROM

In FROM clause you have to list at least one table: this table is usually called “base table” (sometimes referred to as “driving table“). In case you want to list more than one table you have to separate them by a comma and define their join conditions in WHERE clause (see below). This is an old syntax how to join queries and I highly recommend to use the new one; using JOIN statements.

JOIN

There are 5 types of how you can join tables together. We use JOIN statement to connect multiple objects and to be able to retrieve data from all of them.

(INNER) JOIN

It is up to you whether you decide to you the full syntax INNER JOIN or just JOIN. I prefer (and therefore recommend) using JOIN only; it is faster and more readable 🙂 With INNER JOIN only rows where the join condition is met are returned.

-- old syntax
SELECT *
FROM table_a t_a, table_b t_b
WHERE 1=1
AND t_a.id = t_b.id;

-- new syntax (preferred)
SELECT *
FROM table_a t_a
JOIN table_b t_b
ON t_a.id = t_b.id
WHERE 1=1;

Example:

Table A:

ID VALUE
1 A
2 B

Table B:

ID VALUE
2 X
3 Y

Result:

TABLE_A.id TABLE_A.value TABLE_B.id TABLE_B.value
2 B 2 X
LEFT (OUTER) JOIN

The same applies for OUTER joins (right, left, full) – whether you decide to use the full syntax LEFT OUTER JOIN or just LEFT JOIN; both are the same. Again, I stick with the latter for it is shorter version 🙂 Left outer join will return all data with from the left ( a.k.a driving/base) table (usually the one mentioned sooner in the query) and only rows where join condition was met from the joined table. For the non-matched rows, the query will return NULL values.

-- old syntax
SELECT *
FROM table_a t_a, table_b t_b
WHERE 1=1
AND t_a.id = t_b.id(+);

-- new syntax (preferred)
SELECT *
FROM table_a t_a
LEFT JOIN table_b t_b
ON t_a.id = t_b.id
WHERE 1=1;

Example:

Table A:

ID VALUE
1 A
2 B

Table B:

ID VALUE
2 X
3 Y

Result:

TABLE_A.id TABLE_A.value TABLE_B.id TABLE_B.value
1 A NULL NULL
2 B 2 X
RIGHT (OUTER) JOIN

This join does the exact opposite of LEFT JOIN. Instead of returning matching rows from the driving table, it will return data from the joined table. I don’t like this type of join and prefer to determine this by a proper ordering of joined tables – thus I use LEFT or INNER join only (very rarely FULL outer join).

-- old syntax
SELECT *
FROM table_a t_a, table_b t_b
WHERE 1=1
AND t_a.id(+) = t_b.id;

-- new syntax 
SELECT *
FROM table_a t_a
RIGHT JOIN table_b t_b
ON t_a.id = t_b.id
WHERE 1=1;

-- could be written as LEFT join as well by swapping the tables; this will return the exact same data (preferred)
SELECT * 
FROM table_b t_b 
LEFT JOIN table_a t_a 
ON t_a.id = t_b.id 
WHERE 1=1;

Example:

Table A:

ID VALUE
1 A
2 B

Table B:

ID VALUE
2 X
3 Y

Result:

TABLE_A.id TABLE_A.value TABLE_B.id TABLE_B.value
2 B 2 X
NULL NULL 3 Y
FULL (OUTER) JOIN

FULL join will return all data from all tables. Those not matching (from either side) will have NULL values.

-- old syntax
-- FULL OUTER JOIN cannot be written in the old way


-- new syntax (preferred)
SELECT *
FROM table_a t_a
FULL JOIN table_b t_b
ON t_a.id = t_b.id
WHERE 1=1;

Example:

Table A:

ID VALUE
1 A
2 B

Table B:

ID VALUE
2 X
3 Y

Result:

TABLE_A.id TABLE_A.value TABLE_B.id TABLE_B.value
1 A NULL NULL
2 B 2 X
NULL NULL 3 Y
CROSS JOIN

CROSS join produces a cartesian product (creates a combination of all rows from the first table with all rows in the second table) and unlike the other joins it does not require a join condition to be specified. This type of join creates a high load for CPU when used with big tables! Be very careful and try to avoid this join if not necessary.

-- old syntax 
-- CROSS JOIN cannot be written in the old way

-- new syntax (preferred)
SELECT *
FROM table_a t_a
CROSS JOIN table_b t_b
WHERE 1=1;

Example:

Table A:

ID VALUE
1 A
2 B

Table B:

ID VALUE
2 X
3 Y

Result:

 
TABLE_A.id TABLE_A.value TABLE_B.id TABLE_B.value
1 A 2 X
1 A 3 Y
2 B 2 X
2 B 3 Y

WHERE

We use WHERE clause to restrict data (sometimes referred to as “SELECTION”; see below). In where clause you can use one or multiple conditions from one or all tables stated in the FROM or/and JOIN clauses. You can combine multiple operators to get the exact result you need. Always keep the proper formatting and indentation of your code to increase the readability. Don’t forget to use parenthesis to enforce the correct precedence; in other words to group multiple conditions together. You cannot filter aggregates in WHERE clause but you can do that in HAVING clause – see below.

SELECT *
FROM employee emp
--
LEFT JOIN department dpt -- left outer join
ON emp.id_department = dpt.id_department
--
LEFT JOIN position pos -- left outer join
ON pos.id_position = emp.id_position
--
JOIN contract_status cnt_stat -- inner join
ON cnt_stat.id_contract_status = emp.id_contract_status
AND cnt_stat.name_contract_status = 'Active'         -- only active employees (with an active contract)
--
WHERE 1=1
AND (
     dpt.name_department IN ('IT','Sales','Finance')   -- department must be in the one from the list OR
      OR
     lower(pos.name_position) LIKE '%analyst%'         -- or any other position name with a keyword analyst (case insensitive)
    )
AND emp.name_last = 'Smith'      -- last name must be Smith
AND emp.name_first NOT LIKE 'J%' -- first name does not start with a capital J

GROUP BY

This clause allows you to actually create groups of dimensions for data aggregation. You don’t have to necessarily use an aggregation function you can use GROUP BY as well for getting only unique records (or set of records, or rows if you will) from a database.

You can find out more details about GROUP BY and data aggregation in Data aggregation using GROUP BY.

HAVING

HAVING clause comes always together with GROUP BY clause and has a very similar logic to WHERE clause but with one difference; it is used on aggregates only! Because you cannot use aggregates in WHERE clause there is a special “where clause” named HAVING to filter them 🙂 Easy, isn’t it?

Example:
You want to find out how many product types have been ordered more than once only. You cannot filter this count(*) function inside your WHERE clause and thus you have to place it into HAVING clause (which will do the same for you).

SELECT prod_type.type_product
, count(*) as ordered_nbr
FROM order ord
--
LEFT JOIN product prod
ON prod.id_product = ord.id_product
--
LEFT JOIN product_type prod_type
ON prod_type.id_type_product = prod.id_type_product
--
WHERE 1=1
--
GROUP BY name_product
--
HAVING count(*) > 1      -- show only products with more than one record in the ORDER table (= having more than one order)

You can find out more details about HAVING clause also in Data aggregation using GROUP BY.

ORDER BY

Since the nature of each table in a database is that it is unsorted and unorganized, Oracle provides users with a way how to do it; ORDER BY. With this clause, you can easily sort selected columns (only those selected!!) by their names, functions or aliases. As a matter of fact, ORDER BY clause is the only part where you can use aliases as a reference.

On the top of that, you are also allowed to use column position id as a reference too. This might be a little tricky, if you change the projection (see below) columns the sort will automatically change as well. This means if you will sort by 1st column (referred as ORDER BY 1) – no matter what column is selected first it will always sort by it 🙂

SELECT name_product
, type_product
FROM product
ORDER BY 1     -- will sort by NAME

SELECT type_product
, namee_product
FROM product
ORDER BY 1     -- will sort by TYPE

You can specify for each column in ORDER BY clause whether you want to have an ascending (ASC) or descending (DESC) order; if not specified ASC is used as a default order. There is another one option that allows you to handle NULL values in a specified column. You can put them at the beginning (NULLS FIRST) or at the end (NULLS LAST) of your selection, depends on your needs.

SELECT name_first
, name_last
, id_department as id_dpt
, id_position as id_pos     -- fourth column .. referred as 4
, date_birth
, date_hired
, date_separated
FROM employee emp
--
ORDER BY id_dpt DESC
, name_first ASC
, name_last DESC
, 4                               -- 4th is ID_POS ... if not specified ASC or DESC.. ASC is selected by default
, date_hired DESC NULLS LAST      -- NULL values will be placed at the end
, date_separated ASC NULLS FIRT   -- NULL values will be placed at the beginning

You can also sort aggregation functions

SELECT id_client
, count(*) as orders_total
FROM order
GROUP BY id_client
ORDER BY orders_total DESC -- order by number of total orders per client in a descending order

PROJECTION vs SELECTION

You can find these terms very often. It might be confusing and tricky in the beginning but don’t fall into it 🙂 I will help you with that. Simply said, the projection is just a set of columns used in a SELECT part and SELECTION a is restriction of rows in WHERE / HAVING clause.

Considering this query:

SELECT name_first
, name_last
FROM employee
WHERE 1=1
AND id_department = 1;

Projection = name_first, name_last
Selection = id_department = 1

Rules of Thumb (RoT) for writing a SELECT query

  • Indentation / Structuring
  • Readability
    • “ bigger ain‘t better “
  • Good practice
    • Schema name – always use a schema name .. it is handy when you need to run that query from a different schema ( = under a different user)
    • Table aliases – always use descriptive but short table aliases. I have seen many times that 80 – 90 % people (especially juniors) use aliases such as a,b,c,d, … or ever worse a, a1, a2, … Image a big query and you are totally lost with this naming convention. With correctly chosen aliases it will be a piece of cake to tell from which object is that particular column / function extracted.
    • Comments – these helps to recall you (or your colleagues) why did you use that particular conditions or joins or why did you chose the logic you implemented in your code. I recommend using comments as much as possible to make every single piece of code clear.
    • Change log – it is a very good practice to use change long in a header of your procedures, functions or packages. This will help you to track what, when and by whom was changed in the script – especially if you do not use any version software like GIT, SVN, etc. I will describe this in a PL/SQL section.

Quiz:  What operation returns all data from all tables ?

Answer: