Article

Table of Contents
Theme:
Was this article helpful?
Try Vultr Today with

$50 Free on Us!

Want to contribute?

You could earn up to $300 by adding new articles!

Find and Delete Duplicate Rows in MySQL with GROUP BY and HAVING Clauses

Author: Francis Ndungu

Last Updated: Sat, Feb 20, 2021
MySQL and MariaDB Programming

Introduction

Sometimes, you might see more than one copy of the same record appearing in a MySQL database table. This usually happens after importing data from external sources such as spreadsheet applications that don't have the features of relational databases. Having identical copies of the same record may negatively affect your application and business logic. For instance, if customers get registered multiple times in your invoicing software, this might complicate the process of allocating credit limits. To overcome this challenge, you must delete the duplicate customers' records from your database.

In this guide, you'll create a test_db database and a sample customers table. You'll then populate the table with a random list of customers' details containing some duplicates. Finally, you'll use the MySQL GROUP BY and HAVING clauses to find and DROP duplicate rows.

Prerequisites

To follow along with this guide, make sure you have the following:

  • An Ubuntu 20.04 server.
  • A sudo user.
  • A LAMP Stack. To test this guide, you only need to install the MySQL database server.

Create a test_db Database

First, SSH to your server and run the command below to log in to your MySQL database server as root.

$ sudo mysql -u root -p

When prompted, enter the root password of your MySQL server and press ENTER to continue. Then, run the command below to create a test_db database.

mysql> CREATE DATABASE test_db;

Use the test_db database.

mysql> USE test_db;

Next, set up a customers table. In this table, you'll uniquely identify the customers by referring to the customer_id column, which should be auto-populated since you'll define it with a PRIMARY KEY statement. The table will then record the first_name, last_name and phone number of the customer.

Create the customers table.

mysql> CREATE TABLE customers (
       customer_id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
       first_name VARCHAR(50),
       last_name VARCHAR(50),
       phone VARCHAR(15)
       ) ENGINE = InnoDB;

Next, populate the customers table with some records. Please note that it is possible for you to INSERT details of the same customer multiple times in this table. As earlier indicated, you might be importing some records from an external data source that lacks the functionalities of a relational database.

Run the INSERT commands one by one to populate the table.

mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('JOHN', 'DOE', '111111');
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('MARY', 'SMITH', '222222');
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('JACOB', 'JAMES', '444444');
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('STEVE', 'JACKES', '888888');
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('MARY', 'SMITH', '222222');
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('RIAN', 'WHITE', '101010');
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('APPLE', 'GRABRIEL', '242424');
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('RIAN', 'WHITE', '101010');
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('JACOB', 'JAMES', '444444');
mysql> INSERT INTO customers(first_name, last_name, phone) VALUES ('JOHN', 'DOE', '111111');

After each INSERT command, you should get the following output indicating that the record was inserted successfully.

...
Query OK, 1 row affected (0.01 sec)

Ensure the customer's details are in place by running the following SELECT statement against the customers table.

mysql> SELECT
       customer_id,
       first_name,
       last_name,
       phone
       FROM customers;

Confirm the output below.

+-------------+------------+-----------+--------+
| customer_id | first_name | last_name | phone  |
+-------------+------------+-----------+--------+
|           1 | JOHN       | DOE       | 111111 |
|           2 | MARY       | SMITH     | 222222 |
|           3 | JACOB      | JAMES     | 444444 |
|           4 | STEVE      | JACKES    | 888888 |
|           5 | MARY       | SMITH     | 222222 |
|           6 | RIAN       | WHITE     | 101010 |
|           7 | APPLE      | GRABRIEL  | 242424 |
|           8 | RIAN       | WHITE     | 101010 |
|           9 | JACOB      | JAMES     | 444444 |
|          10 | JOHN       | DOE       | 111111 |
+-------------+------------+-----------+--------+
10 rows in set (0.00 sec)

As you can see from the output above, JOHN DOE's, MARY SMITH's, RIAN WHITE and JACOB JAMES' records have been repeated. You can notice this difference by physically examining the output from the SELECT statement above because you just have a few records in your table.

Your table might contain thousands or millions of records in a production environment, and finding and eliminating duplicates manually may not be a feasible solution. In the next step, you'll learn how to do this using a single SQL statement.

Determine the Duplicate Rows

In MySQL, you can find the duplicate rows by executing a GROUP BY clause against the target column and then using the HAVING clause to check a group having more than 1 record.

For instance, to find duplicates in your sample customers table, use the MySQL CONCAT statement to concatenate the first_name, last_name and phone fields to a single derived column that you'll use to group the records. Then, use the HAVING clause to check any resulting group having multiple records.

When completed, your SQL syntax should be similar to the following statement.

mysql> SELECT
       customer_id,
       first_name,
       last_name,
       phone,
       COUNT(CONCAT(first_name, last_name, phone)) as total_count
       FROM
       customers
       GROUP BY
       CONCAT(first_name, last_name, phone)
       HAVING total_count > 1 ;

Execute the command above. Then, examine the output below. As you can see from the following list, the SQL command has retrieved all customers having more than one record.

+-------------+------------+-----------+--------+-------------+
| customer_id | first_name | last_name | phone  | total_count |
+-------------+------------+-----------+--------+-------------+
|           1 | JOHN       | DOE       | 111111 |           2 |
|           2 | MARY       | SMITH     | 222222 |           2 |
|           3 | JACOB      | JAMES     | 444444 |           2 |
|           6 | RIAN       | WHITE     | 101010 |           2 |
+-------------+------------+-----------+--------+-------------+
4 rows in set (0.01 sec)

The next step is determining the rows that should remain when the duplicates are deleted. Use the command below to get the PRIMARY KEYs of these rows.

mysql> SELECT MIN(customer_id)
       FROM CUSTOMERS
       GROUP BY CONCAT(first_name, last_name, phone);

Essentially, the MySQL MIN() function returns the first record in each group of the customer's records when grouped by the concatenated column. If a customer name is repeated several times, the MIN() function ensures that only the PRIMARY KEY of the record that appears first for each group is retrieved, as shown below.

+------------------+
| MIN(customer_id) |
+------------------+
|                1 |
|                2 |
|                3 |
|                4 |
|                6 |
|                7 |
+------------------+
6 rows in set (0.00 sec)

Drop the Duplicate Rows

Once you've determine your clean list of the customer_ids, get rid of the duplicate rows by telling MySQL to delete all records from the customers table(DELETE FROM customers....) EXCEPT those that appear in the whitelist. In other words, delete everything from the customers table apart from the whitelisted records(NOT IN).

Please note, in MySQL, you can't modify the same table when using a SELECT command against it in the same SQL statement. To avoid encountering an error, nest the ....SELECT MIN()... statement one level deeper as shown below.

mysql> DELETE FROM customers
       WHERE customer_id NOT IN
       (
           SELECT
           customer_id
           FROM
           (
               SELECT MIN(customer_id) as customer_id
               FROM CUSTOMERS
               GROUP BY CONCAT(first_name, last_name, phone)
           ) AS duplicate_customer_ids
       );

Once you execute the command above, the MySQL server should delete the 4 duplicate records as you can confirm from the output below.

Query OK, 4 rows affected (0.01 sec)

Ensure that your customers table doesn't contain any more duplicates by executing the GROUP BY and HAVING statement one more time.

mysql> SELECT
       customer_id,
       first_name,
       last_name,
       phone,
       COUNT(CONCAT(first_name, last_name, phone)) as total_count
       FROM
       customers
       GROUP BY
       CONCAT(first_name, last_name, phone)
       HAVING total_count > 1 ;

This time around, you should get an Empty set since there are no duplicates.

Empty set (0.00 sec)

Query the customers table again.

mysql> SELECT
       customer_id,
       first_name,
       last_name,
       phone
       FROM customers;

Even by physically examining the list below, you can see that the duplicate records have been removed completely.

+-------------+------------+-----------+--------+
| customer_id | first_name | last_name | phone  |
+-------------+------------+-----------+--------+
|           1 | JOHN       | DOE       | 111111 |
|           2 | MARY       | SMITH     | 222222 |
|           3 | JACOB      | JAMES     | 444444 |
|           4 | STEVE      | JACKES    | 888888 |
|           6 | RIAN       | WHITE     | 101010 |
|           7 | APPLE      | GRABRIEL  | 242424 |
+-------------+------------+-----------+--------+
6 rows in set (0.00 sec)

Conclusion

In this guide, you've learned how to use the MySQL GROUP BY and HAVING clauses to find and delete duplicate records in a database. Use the logic each time you import data from non-relational data sources to eliminate identical rows that might interfere with your business logic.

Want to contribute?

You could earn up to $300 by adding new articles