---
layout: handbook-page-toc
title: "SQL Style Guide"
description: "GitLab Data Team Handbook"
---
## On this page
{:.no_toc .hidden-md .hidden-lg}
- TOC
{:toc .toc-list-icons .hidden-md .hidden-lg}
----
## SQL Style Guide
**Since we don't have a linter, it is *our collective responsibility* to enforce this Style Guide.**
#### Field Naming and Reference Conventions
* Field names should all be lowercased.
* An `id`, `name`, or generally ambiguous value such as `type` should always be prefixed by what it is identifying or naming
```sql
-- Good
SELECT
id AS account_id,
name AS account_name,
type AS account_type,
...
-- Bad
SELECT
id,
name,
type,
...
```
* When joining to any data from a different source, a field should be prefixed with the data source, e.g. `sfdc_account_id`, to avoid ambiguity
```sql
-- Good
SELECT
sfdc_account.account_id AS sfdc_account_id,
zuora_account.account_id AS zuora_account_id
FROM sfdc_account
LEFT JOIN zuora_account ON ...
-- Bad
SELECT
sfdc_account.account_id,
zuora_account.account_id AS zuora_id
FROM sfdc_account
LEFT JOIN zuora_account ON ...
```
* When joining tables and referencing columns from both, strongly prefer to reference the full table name instead of an alias. When the table name is long (~20), try to rename the CTE if possible, and lastly consider aliasing to something descriptive.
```sql
-- Good
SELECT
budget_forecast_cogs_opex.account_id,
-- 15 more columns
date_details.fiscal_year,
date_details.fiscal_quarter,
date_details.fiscal_quarter_name,
cost_category.cost_category_level_1,
cost_category.cost_category_level_2
FROM budget_forecast_cogs_opex
LEFT JOIN date_details
ON date_details.first_day_of_month = budget_forecast_cogs_opex.accounting_period
LEFT JOIN cost_category
ON budget_forecast_cogs_opex.unique_account_name = cost_category.unique_account_name
-- Ok, but not preferred. Consider renaming the CTE in lieu of aliasing
SELECT
bfcopex.account_id,
-- 15 more columns
date_details.fiscal_year,
date_details.fiscal_quarter,
date_details.fiscal_quarter_name,
cost_category.cost_category_level_1,
cost_category.cost_category_level_2
FROM budget_forecast_cogs_opex bfcopex
LEFT JOIN date_details
ON date_details.first_day_of_month = bfcopex.accounting_period
LEFT JOIN cost_category
ON bfcopex.unique_account_name = cost_category.unique_account_name
-- Bad
SELECT
a.*,
-- 15 more columns
b.fiscal_year,
b.fiscal_quarter,
b.fiscal_quarter_name,
c.cost_category_level_1,
c.cost_category_level_2
FROM budget_forecast_cogs_opex a
LEFT JOIN date_details b
ON b.first_day_of_month = a.accounting_period
LEFT JOIN cost_category c
ON b.unique_account_name = c.unique_account_name
```
* All field names should be [snake-cased](https://en.wikipedia.org/wiki/Snake_case)
```sql
-- Good
SELECT
dvcecreatedtstamp AS device_created_timestamp
FROM table
-- Bad
SELECT
dvcecreatedtstamp AS DeviceCreatedTimestamp
FROM table
```
* Boolean field names should start with `has_`, `is_`, or `does_`
```sql
-- Good
SELECT
deleted AS is_deleted,
sla AS has_sla
FROM table
-- Bad
SELECT
deleted,
sla
FROM table
```
**Dates**
- Timestamps should end with `_at`, e.g. `deal_closed_at`, and should always be in UTC
- Dates should end with `_date`, e.g. `deal_closed_date`
- Months should be indicated as such and should always be truncated to a date format, e.g. `deal_closed_month`
- Always avoid key words like `date` or `month` as a column name
- Prefer the explicit date function over `date_part`, but prefer `date_part` over `extract`, e.g. `DAYOFWEEK(created_at)` > `DATE_PART(dayofweek, 'created_at')` > `EXTRACT(dow FROM created_at)`
- Note that selecting a date's part is different from truncating the date. `date_trunc('month', created_at)` will produce the calendar month ('2019-01-01' for '2019-01-25') while `SELECT date_part('month', '2019-01-25'::date)` will produce the number 1
- Be careful using [DATEDIFF](https://docs.snowflake.net/manuals/sql-reference/functions/datediff.html), as the results are often non-intuitive.
- For example, `SELECT DATEDIFF('days', '2001-12-01 23:59:59.999', '2001-12-02 00:00:00.000')` returns `1` even though the timestamps are different by one millisecond.
- Similarly, `SELECT DATEDIFF('days', '2001-12-01 00:00:00.001', '2001-12-01 23:59:59.999')` return `0` even though the timestamps are nearly an entire day apart.
- Using the appropriate interval with the `DATEDIFF` function will ensure you are getting the right results. For example, `DATEDIFF('days', '2001-12-01 23:59:59.999', '2001-12-02 00:00:00.000')` will provide a `1 day interval` and `DATEDIFF('ms', '2001-12-01 23:59:59.999', '2001-12-02 00:00:00.000')` will provide a `1 millisecond interval`.
#### Use CTEs (Common Table Expressions), not subqueries
- [CTEs make SQL more readable and are more performant](https://www.alisa-in.tech/post/2019-10-02-ctes/)
- Use CTEs to reference other tables. Think of these as import statements
- CTEs should be placed at the top of the query
- Where performance permits, CTEs should perform a single, logical unit of work
- CTE names should be as concise as possible while still being clear
- Avoid long names like `replace_sfdc_account_id_with_master_record_id` and prefer a shorter name with a comment in the CTE. This will help avoid table aliasing in joins
- CTEs with confusing or noteable logic should be commented in file and documented in dbt docs
- CTEs that are duplicated across models should be pulled out into their own models
- Leave an empty row above and below the query statement
- CTEs should be formatted as follows:
``` sql
WITH events AS ( -- think of these select statements as your import statements.
...
), filtered_events AS ( -- CTE comments go here
...
)
SELECT * -- you should always aim to "select * from final" for your last model
FROM filtered_events
```
#### General
- Indents for the main part of a query should be 4 spaces (i.e. when in a CTE)
- Further indents should be two spaces (except for predicates, which should line up with the `WHERE` keyword)
- No tabs should be used - only spaces. Your editor should be setup to convert tabs to spaces - see our [onboarding template](https://gitlab.com/gitlab-data/analytics/-/blob/master/.gitlab/issue_templates/Data%20Onboarding.md#data-grip-configuration) for more details
- Lines of SQL should be no longer than 80 characters
- Commas should be at the end-of-line (EOL) as a right comma, with the exception of temporary filters in the `WHERE` clause for specific values.
```sql
-- Good
SELECT
deleted AS is_deleted, -- EOL right comma
accountId AS account_id
FROM table
WHERE is_deleted = false
AND account_id NOT IN (
'232'
, '234' -- left comma
, '425'
)
-- Bad
SELECT
deleted AS is_deleted, -- EOL right comma
accountId AS account_id
FROM table
WHERE is_deleted = false
AND account_id NOT IN ('232', '234', '425')
```
- When `SELECT`ing, always give each column its own row, with the exception of `SELECT *` which can be on a single row
- `DISTINCT` should be included on the same row as `SELECT`
- The `AS` keyword should be used when projecting a field or table name
- When aliasing use `AS`, strive to align the original column names on a single vertical line and the `AS` keyword on a separate vertical line
- Fields should be stated before aggregates / window functions
- Ordering and grouping by a number (eg. group by 1, 2) is preferred
- Prefer `WHERE` to `HAVING` when either would suffice
- Prefer accessing JSON using the bracket syntax, e.g. `data_by_row['id']::bigint as id_value`
- **Never** use `USING` in joins. On Snowflake, it will produce inaccurate results
- Prefer `UNION ALL` to `UNION`. This is because a `UNION` could indicate upstream data integrity issue that are better solved elsewhere.
- Prefer `!=` to `<>`. This is because `!=` is more common in other programming languages and reads like "not equal" which is how we're more likely to speak
- Consider performance. Understand the difference between `LIKE` vs `ILIKE`, `IS` vs `=`, and `NOT` vs `!` vs `<>`. Use appropriately
- Prefer `lower(column) LIKE '%match%'` to `column ILIKE '%Match%'`. This lowers the chance of stray capital letters leading to an unexpected result
- Prefer using `VARCHAR` when casting this data type and not specifying a length. A column only consumes storage for the amount of actual data stored.
- Familiarize yourself with [the DRY Principal](https://docs.getdbt.com/docs/design-patterns). Leverage CTEs, jinja and macros in dbt, and snippets in Sisense. If you type the same line twice, it needs to be maintained in two places
- **DO NOT OPTIMIZE FOR A SMALLER NUMBER OF LINES OF CODE. NEWLINES ARE CHEAP. [BRAIN TIME IS EXPENSIVE.](https://blog.getdbt.com/write-better-sql-a-defense-of-group-by-1/)**
#### Functions
- Function names and keywords should all be capitalized
- Prefer `IFNULL` TO `NVL`
- Prefer `IFF` to a single line `CASE WHEN` statement
- Prefer `IFF` to selecting a boolean statement `(amount < 10) AS is_less_than_ten`
#### JOINs
- Be explicit when joining, e.g. use `LEFT JOIN` instead of `JOIN`. (Default joins are `INNER`)
- Prefix the table name to a column when joining, otherwise omit
- Specify the order of a join with the FROM table first and JOIN table second:
```sql
-- Good
FROM source
LEFT JOIN other_source
ON source.id = other_source.id
WHERE ...
-- Bad
FROM source
LEFT JOIN other_source
ON other_source.id = source.id
WHERE ...
```
#### Example Code
* Putting it all together:
```sql
WITH my_data AS (
SELECT *
FROM analytics.my_data
WHERE filter = 'my_filter'
), some_cte AS (
SELECT DISTINCT
id,
other_field_1,
other_field_2
FROM analytics.my_other_data
), final AS (
SELECT
data_by_row['id']::bigint AS id_field,
field_1 AS detailed_field_1,
field_2 AS detailed_field_2,
detailed_field_3,
CASE
WHEN cancellation_date IS NULL AND expiration_date IS NOT NULL
THEN expiration_date
WHEN cancellation_date IS NULL
THEN start_date+7
ELSE cancellation_date
END AS cancellation_date,
LAG(detailed_field_3) OVER (
PARTITION BY
id_field,
detailed_field_1
ORDER BY cancellation_date
) AS previous_detailed_field_3,
SUM(field_4) AS field_4_sum,
MAX(field_5) AS field_5_max
FROM my_data
LEFT JOIN some_cte
ON my_data.id = some_cte.id
WHERE field_1 = 'abc'
AND (field_2 = 'def' OR field_2 = 'ghi')
GROUP BY 1, 2, 3, 4, 5
HAVING COUNT(*) > 1
ORDER BY 4 DESC
)
SELECT *
FROM final
```
#### Commenting
* When making single line comments in a model use the `--` syntax
* When making multi-line comments in a model use the `/* */` syntax
* Respect the character line limit when making comments. Move to a new line or to the model documentation if the comment is too long
* dbt model comments should live in the model documentation
* Calculations made in SQL should have a brief description of what's going on and a link to the handbook defining the metric (and how it's calculated)
* Instead of leaving `TODO` comments, create new issues for improvement
### Other SQL Style Guides
- [Fishtown Analytics](https://github.com/fishtown-analytics/corp/blob/master/dbt_coding_conventions.md#sql-style-guide)
- [Matt Mazur](https://github.com/mattm/sql-style-guide)
- [Kickstarter](https://gist.github.com/fredbenenson/7bb92718e19138c20591)