The most commonly used and basic type of SQL (stands for Structured Query Language) is a SELECT statement.
SELECT
With the SELECT statement, we can retrieve data from a table. There are several ways of how to retrieve the data and I will try to highlight not only the best practices but also the worst practices with a demonstration why this particular approach is wrong and the other one is correct.
SELECT statement consists of 7even sections (2 are mandatory, 5 are optional)
MANDATORY
- SELECT
- FROM
OPTIONAL
- JOIN -> left, right, full / outer, inner, cross
- WHERE
- GROUP BY
- HAVING
- ORDER BY
As you will go through these notes, I will explain each and every section in detail.
Let’s start with the mandatory section. To extract data from a table, the simplest way how to do it is to use a generic query as below:
SELECT * | {[DISTINCT] column | expression [alias], ...} FROM table
* – asterisk represents all column in a given table
DISTINCT
- What does it do?
- DISTINCT clause allows us to get only unique set of selected columns/expressions
- When to use it?
- use distinct only if you want to see unique values for some manual checking
- you can achieve the same result by using count() function which is recommended over using DISTINCT because the cost is the same for both queries but with count() you get additional information
- When to NOT use it?
- Please avoid using DISTINCT clause in your queries for reporting or even for data loading
- It has a serious (negative) impact on the performance of your scripts not to mention that it is a sign that developers/analysts do not know what they are doing. Instead of analyzing their script (most often an incorrect JOIN condition) to eliminate duplicates, they simply add DISTINCT at the beginning of their query.
- Tip: To debug the query (identify the culprit / JOIN) – I always start with the base_table (the one that is after FROM clause) and run count over the records. Then I gradually add more and more joins until I find the one causing duplicates.
- Why to NOT use it?
- As it was said, it has a negative impact on the performance of your script because Oracle has to select all values first, sort them and remove all duplicates
- You can picture it as if someone will give you millions of papers with different words and you need to extract only those that are unique … dreadful, right? How about if you get already unique words? 🙂
- How to avoid using it?
- Always debug your query and find the root cause which causes the duplicates. In 99% cases, it is an incorrect JOIN condition which you can fix easily in the code. The rest could be data duplicity in the source tables which needs a little more effort to cleanse (identify how the table is populated, check the code that is putting data into the data, fix it and reload the table or cleanse data inside).
- GROUP BY vs DISTINCT
- DISTINCT and GROUP BY have the same effect on the result. Both will eliminate duplicates and the cost for both is the same.
-- DISTINCT vs GROUP BY demonstration set serveroutput on set linesize 255 set echo on set autotrace traceonly explain SELECT DISTINCT owner ,object_name ,object_type FROM all_objects WHERE 1 = 1 AND owner = 'OWNER_DWH' ; SELECT owner ,object_name ,object_type FROM all_objects WHERE 1 = 1 AND owner = 'OWNER_DWH' GROUP BY owner ,object_name ,object_type ; set autotrace off -- manual v$sql check SELECT sq.sql_id , sq.sql_text , sq.command_type ct , sq.executions exs , sq.rows_processed rws -- , sq.buffer_gets bg , sq.disk_reads dr -- , sq.elapsed_time et , sq.cpu_time cput , sq.user_io_wait_time uiot -- FROM v$sql sq -- WHERE 1=1 AND sql_text LIKE 'SELECT owner%' OR sql_text LIKE 'SELECT DISTINCT owner%'
FROM
In FROM clause you have to list at least one table: this table is usually called “base table” (sometimes referred to as “driving table“). In case you want to list more than one table you have to separate them by a comma and define their join conditions in WHERE clause (see below). This is an old syntax how to join queries and I highly recommend to use the new one; using JOIN statements.
JOIN
There are 5 types of how you can join tables together. We use JOIN statement to connect multiple objects and to be able to retrieve data from all of them.
(INNER) JOIN
It is up to you whether you decide to you the full syntax INNER JOIN or just JOIN. I prefer (and therefore recommend) using JOIN only; it is faster and more readable 🙂 With INNER JOIN only rows where the join condition is met are returned.
-- old syntax SELECT * FROM table_a t_a, table_b t_b WHERE 1=1 AND t_a.id = t_b.id; -- new syntax (preferred) SELECT * FROM table_a t_a JOIN table_b t_b ON t_a.id = t_b.id WHERE 1=1;
Example:
Table A:
ID | VALUE |
1 | A |
2 | B |
Table B:
ID | VALUE |
2 | X |
3 | Y |
Result:
TABLE_A.id | TABLE_A.value | TABLE_B.id | TABLE_B.value |
2 | B | 2 | X |
LEFT (OUTER) JOIN
The same applies for OUTER joins (right, left, full) – whether you decide to use the full syntax LEFT OUTER JOIN or just LEFT JOIN; both are the same. Again, I stick with the latter for it is shorter version 🙂 Left outer join will return all data with from the left ( a.k.a driving/base) table (usually the one mentioned sooner in the query) and only rows where join condition was met from the joined table. For the non-matched rows, the query will return NULL values.
-- old syntax SELECT * FROM table_a t_a, table_b t_b WHERE 1=1 AND t_a.id = t_b.id(+); -- new syntax (preferred) SELECT * FROM table_a t_a LEFT JOIN table_b t_b ON t_a.id = t_b.id WHERE 1=1;
Example:
Table A:
ID | VALUE |
1 | A |
2 | B |
Table B:
ID | VALUE |
2 | X |
3 | Y |
Result:
TABLE_A.id | TABLE_A.value | TABLE_B.id | TABLE_B.value |
1 | A | NULL | NULL |
2 | B | 2 | X |
RIGHT (OUTER) JOIN
This join does the exact opposite of LEFT JOIN. Instead of returning matching rows from the driving table, it will return data from the joined table. I don’t like this type of join and prefer to determine this by a proper ordering of joined tables – thus I use LEFT or INNER join only (very rarely FULL outer join).
-- old syntax SELECT * FROM table_a t_a, table_b t_b WHERE 1=1 AND t_a.id(+) = t_b.id; -- new syntax SELECT * FROM table_a t_a RIGHT JOIN table_b t_b ON t_a.id = t_b.id WHERE 1=1; -- could be written as LEFT join as well by swapping the tables; this will return the exact same data (preferred) SELECT * FROM table_b t_b LEFT JOIN table_a t_a ON t_a.id = t_b.id WHERE 1=1;
Example:
Table A:
ID | VALUE |
1 | A |
2 | B |
Table B:
ID | VALUE |
2 | X |
3 | Y |
Result:
TABLE_A.id | TABLE_A.value | TABLE_B.id | TABLE_B.value |
2 | B | 2 | X |
NULL | NULL | 3 | Y |
FULL (OUTER) JOIN
FULL join will return all data from all tables. Those not matching (from either side) will have NULL values.
-- old syntax -- FULL OUTER JOIN cannot be written in the old way -- new syntax (preferred) SELECT * FROM table_a t_a FULL JOIN table_b t_b ON t_a.id = t_b.id WHERE 1=1;
Example:
Table A:
ID | VALUE |
1 | A |
2 | B |
Table B:
ID | VALUE |
2 | X |
3 | Y |
Result:
TABLE_A.id | TABLE_A.value | TABLE_B.id | TABLE_B.value |
1 | A | NULL | NULL |
2 | B | 2 | X |
NULL | NULL | 3 | Y |
CROSS JOIN
CROSS join produces a cartesian product (creates a combination of all rows from the first table with all rows in the second table) and unlike the other joins it does not require a join condition to be specified. This type of join creates a high load for CPU when used with big tables! Be very careful and try to avoid this join if not necessary.
-- old syntax -- CROSS JOIN cannot be written in the old way -- new syntax (preferred) SELECT * FROM table_a t_a CROSS JOIN table_b t_b WHERE 1=1;
Example:
Table A:
ID | VALUE |
1 | A |
2 | B |
Table B:
ID | VALUE |
2 | X |
3 | Y |
Result:
TABLE_A.id | TABLE_A.value | TABLE_B.id | TABLE_B.value |
1 | A | 2 | X |
1 | A | 3 | Y |
2 | B | 2 | X |
2 | B | 3 | Y |
WHERE
We use WHERE clause to restrict data (sometimes referred to as “SELECTION”; see below). In where clause you can use one or multiple conditions from one or all tables stated in the FROM or/and JOIN clauses. You can combine multiple operators to get the exact result you need. Always keep the proper formatting and indentation of your code to increase the readability. Don’t forget to use parenthesis to enforce the correct precedence; in other words to group multiple conditions together. You cannot filter aggregates in WHERE clause but you can do that in HAVING clause – see below.
SELECT * FROM employee emp -- LEFT JOIN department dpt -- left outer join ON emp.id_department = dpt.id_department -- LEFT JOIN position pos -- left outer join ON pos.id_position = emp.id_position -- JOIN contract_status cnt_stat -- inner join ON cnt_stat.id_contract_status = emp.id_contract_status AND cnt_stat.name_contract_status = 'Active' -- only active employees (with an active contract) -- WHERE 1=1 AND ( dpt.name_department IN ('IT','Sales','Finance') -- department must be in the one from the list OR OR lower(pos.name_position) LIKE '%analyst%' -- or any other position name with a keyword analyst (case insensitive) ) AND emp.name_last = 'Smith' -- last name must be Smith AND emp.name_first NOT LIKE 'J%' -- first name does not start with a capital J
GROUP BY
This clause allows you to actually create groups of dimensions for data aggregation. You don’t have to necessarily use an aggregation function you can use GROUP BY as well for getting only unique records (or set of records, or rows if you will) from a database.
You can find out more details about GROUP BY and data aggregation in Data aggregation using GROUP BY.
HAVING
HAVING clause comes always together with GROUP BY clause and has a very similar logic to WHERE clause but with one difference; it is used on aggregates only! Because you cannot use aggregates in WHERE clause there is a special “where clause” named HAVING to filter them 🙂 Easy, isn’t it?
Example:
You want to find out how many product types have been ordered more than once only. You cannot filter this count(*) function inside your WHERE clause and thus you have to place it into HAVING clause (which will do the same for you).
SELECT prod_type.type_product , count(*) as ordered_nbr FROM order ord -- LEFT JOIN product prod ON prod.id_product = ord.id_product -- LEFT JOIN product_type prod_type ON prod_type.id_type_product = prod.id_type_product -- WHERE 1=1 -- GROUP BY name_product -- HAVING count(*) > 1 -- show only products with more than one record in the ORDER table (= having more than one order)
You can find out more details about HAVING clause also in Data aggregation using GROUP BY.
ORDER BY
Since the nature of each table in a database is that it is unsorted and unorganized, Oracle provides users with a way how to do it; ORDER BY. With this clause, you can easily sort selected columns (only those selected!!) by their names, functions or aliases. As a matter of fact, ORDER BY clause is the only part where you can use aliases as a reference.
On the top of that, you are also allowed to use column position id as a reference too. This might be a little tricky, if you change the projection (see below) columns the sort will automatically change as well. This means if you will sort by 1st column (referred as ORDER BY 1) – no matter what column is selected first it will always sort by it 🙂
SELECT name_product , type_product FROM product ORDER BY 1 -- will sort by NAME SELECT type_product , namee_product FROM product ORDER BY 1 -- will sort by TYPE
You can specify for each column in ORDER BY clause whether you want to have an ascending (ASC) or descending (DESC) order; if not specified ASC is used as a default order. There is another one option that allows you to handle NULL values in a specified column. You can put them at the beginning (NULLS FIRST) or at the end (NULLS LAST) of your selection, depends on your needs.
SELECT name_first , name_last , id_department as id_dpt , id_position as id_pos -- fourth column .. referred as 4 , date_birth , date_hired , date_separated FROM employee emp -- ORDER BY id_dpt DESC , name_first ASC , name_last DESC , 4 -- 4th is ID_POS ... if not specified ASC or DESC.. ASC is selected by default , date_hired DESC NULLS LAST -- NULL values will be placed at the end , date_separated ASC NULLS FIRT -- NULL values will be placed at the beginning
You can also sort aggregation functions
SELECT id_client , count(*) as orders_total FROM order GROUP BY id_client ORDER BY orders_total DESC -- order by number of total orders per client in a descending order
PROJECTION vs SELECTION
You can find these terms very often. It might be confusing and tricky in the beginning but don’t fall into it 🙂 I will help you with that. Simply said, the projection is just a set of columns used in a SELECT part and SELECTION a is restriction of rows in WHERE / HAVING clause.
Considering this query:
SELECT name_first , name_last FROM employee WHERE 1=1 AND id_department = 1;
Projection = name_first, name_last
Selection = id_department = 1
Rules of Thumb (RoT) for writing a SELECT query
- Indentation / Structuring
- Readability
- “ bigger ain‘t better “
- Good practice
- Schema name – always use a schema name .. it is handy when you need to run that query from a different schema ( = under a different user)
- Table aliases – always use descriptive but short table aliases. I have seen many times that 80 – 90 % people (especially juniors) use aliases such as a,b,c,d, … or ever worse a, a1, a2, … Image a big query and you are totally lost with this naming convention. With correctly chosen aliases it will be a piece of cake to tell from which object is that particular column / function extracted.
- Comments – these helps to recall you (or your colleagues) why did you use that particular conditions or joins or why did you chose the logic you implemented in your code. I recommend using comments as much as possible to make every single piece of code clear.
- Change log – it is a very good practice to use change long in a header of your procedures, functions or packages. This will help you to track what, when and by whom was changed in the script – especially if you do not use any version software like GIT, SVN, etc. I will describe this in a PL/SQL section.
Quiz: What operation returns all data from all tables ?
Answer: