How To Remove Duplicates In SQL?

Table of Content

Using DISTINCT to Select Unique Rows
Using ROW_NUMBER() for More Complex Duplicates Removal
Using GROUP BY to Aggregate Duplicates
Using DELETE with JOIN
Notes:

Removing duplicates in SQL can be done using various methods depending on the specific requirements and SQL dialect you're using. Below are common approaches for removing duplicates from a table.

Using `DISTINCT` to Select Unique Rows

To select unique rows from a table, you can use the DISTINCT keyword:

SELECT DISTINCT column1, column2, ...
FROM table_name;

Using `ROW_NUMBER()` for More Complex Duplicates Removal

If you need to delete duplicate rows based on specific columns, you can use the ROW_NUMBER() window function. Here’s an example:

Identify Duplicates: Create a CTE (Common Table Expression) or subquery to identify duplicates using ROW_NUMBER().

Delete Duplicates: Delete the rows that are identified as duplicates.

Example:

Assume you have a table employees with columns id, name, and email, and you want to remove rows where name and email are duplicated.

WITH CTE AS (
    SELECT 
        id, 
        name, 
        email,
        ROW_NUMBER() OVER (PARTITION BY name, email ORDER BY id) AS row_num
    FROM employees
)
DELETE FROM employees
WHERE id IN (
    SELECT id
    FROM CTE
    WHERE row_num > 1
);

Using `GROUP BY` to Aggregate Duplicates

Another approach is to use GROUP BY to group the rows and then insert the distinct rows into a new table or temporary table.

Example:

-- Create a new table without duplicates
CREATE TABLE employees_unique AS
SELECT 
    MIN(id) AS id, 
    name, 
    email
FROM employees
GROUP BY name, email;

-- Optionally drop the original table
DROP TABLE employees;

-- Rename the new table to the original table name
ALTER TABLE employees_unique RENAME TO employees;

Using `DELETE` with `JOIN`

If you prefer to delete duplicates directly without creating a new table, you can use a self-join:

Example:

DELETE e1
FROM employees e1
JOIN employees e2
ON e1.name = e2.name 
AND e1.email = e2.email 
AND e1.id > e2.id;

Notes:

Backup Your Data: Always ensure you have a backup of your data before performing delete operations.
Test Your Queries: Test your queries on a small subset of data to ensure they work as expected.
Database-Specific Syntax: SQL syntax can vary between databases (MySQL, PostgreSQL, SQL Server, etc.). Ensure to adapt the syntax to your specific SQL dialect.

By using these methods, you can effectively remove duplicate rows from your SQL tables and maintain data integrity.