Table of Content
Removing duplicates in SQL can be done using various methods depending on the specific requirements and SQL dialect you're using. Below are common approaches for removing duplicates from a table.
Using DISTINCT
to Select Unique Rows
To select unique rows from a table, you can use the DISTINCT
keyword:
SELECT DISTINCT column1, column2, ...
FROM table_name;
Using ROW_NUMBER()
for More Complex Duplicates Removal
If you need to delete duplicate rows based on specific columns, you can use the ROW_NUMBER()
window function. Here’s an example:
Identify Duplicates: Create a CTE (Common Table Expression) or subquery to identify duplicates using ROW_NUMBER()
.
Delete Duplicates: Delete the rows that are identified as duplicates.
Example:
Assume you have a table employees
with columns id
, name
, and email
, and you want to remove rows where name
and email
are duplicated.
WITH CTE AS (
SELECT
id,
name,
email,
ROW_NUMBER() OVER (PARTITION BY name, email ORDER BY id) AS row_num
FROM employees
)
DELETE FROM employees
WHERE id IN (
SELECT id
FROM CTE
WHERE row_num > 1
);
Using GROUP BY
to Aggregate Duplicates
Another approach is to use GROUP BY
to group the rows and then insert the distinct rows into a new table or temporary table.
Example:
-- Create a new table without duplicates
CREATE TABLE employees_unique AS
SELECT
MIN(id) AS id,
name,
email
FROM employees
GROUP BY name, email;
-- Optionally drop the original table
DROP TABLE employees;
-- Rename the new table to the original table name
ALTER TABLE employees_unique RENAME TO employees;
Using DELETE
with JOIN
If you prefer to delete duplicates directly without creating a new table, you can use a self-join:
Example:
DELETE e1
FROM employees e1
JOIN employees e2
ON e1.name = e2.name
AND e1.email = e2.email
AND e1.id > e2.id;
Notes:
- Backup Your Data: Always ensure you have a backup of your data before performing delete operations.
- Test Your Queries: Test your queries on a small subset of data to ensure they work as expected.
- Database-Specific Syntax: SQL syntax can vary between databases (MySQL, PostgreSQL, SQL Server, etc.). Ensure to adapt the syntax to your specific SQL dialect.
By using these methods, you can effectively remove duplicate rows from your SQL tables and maintain data integrity.
Discussion