Removing duplicates is a common data cleaning task. It is also a very common interview question. Questions about removing duplicate data are common for jobs that need individuals to have Microsoft Excel skills. But even if you are not preparing for an interview, learning how to clean datasets will not hurt anyone.
This article will discuss three methods of removing duplicate data from datasets. You will understand the merits and demerits of each method. You will also see why one of the methods might give you a different result when compared to the others.
5 methods of removing duplicate entries in Microsoft Excel:
- Removing duplicates using Power Query
- Removing duplicates using the Remove Duplicates feature
- Removing duplicates using the UNIQUE function
- Removing duplicates using conditional formatting and filter option
- Removing duplicates using the COUNTIF function and filter option
Method #1: Removing Duplicates Using Power Query
Power Query is a powerful tool for removing duplicates from a dataset. The tool is more powerful than other methods because it updates automatically. If you add more data to the data source, simply refresh the table prepared by Power Query. Power Query will check the entire dataset and remove any duplicates. Here you will learn how to use Power Query in eight steps.
Step 1: Open Power Query
This step varies depending on your version of Microsoft Excel. If you use an older version of Microsoft Excel (2010 or 2013), you need to install the Power Query add-on.
If you use a newer version (2016), open Power Query by clicking on Data in the top menu tab.
New version (Excel 2016):
Old version (Excel 2013):
Step 2: Input data for processing
This step is similar for both versions of Excel. To input data, do the following:
- Select all the columns where you have your data.
- Select your data source from the top ribbon:
- For the newer versions (Excel 2016 and later), select (Get Data) From Sheet
- For the older versions (Excel 2013 and earlier), select From Table/Range
New Excel:
Older Excel:
The Power Query editor will open.
Step 3: Remove Duplicates
Now that the Power Query editor is open:
- Hold the shift button on your keyboard and click on the last column in your dataset.
- Right-click on the last column. You will get a pop-up menu.
- In the pop-up menu, click on Remove Duplicates.
Step 4: Close & Load
This is the final step within the Power Query editor.
- Click the small arrow below the Close & Load icon. You will get a pop-up menu.
- From the pop-up menu, select the Close & Load To… option.
Step 5: Select the destination
To select the location of your new table, do the following:
- Select an empty column.
- A window will appear titled ‘Load To’.
- Make sure you have indicated where the data must go to, but selecting either of the options under ‘Select where the data should be loaded.’
- Click on the Load button.
You will now get a new table free of duplicates.
However, note that the Power Query method of removing duplicates is case-sensitive. As a result, duplicate data with different cases will not be removed. For example, look at the image below.
While this may be desirable in certain circumstances, it can be problematic in other cases. To get around this, follow the next step.
Step 6: Change the steps in Power Query
Open the Power Query.
On the right-hand side of Power Query, you will see the Applied Steps option. In that option, do the following:
- Click on Change Type. The data will return to its former state; it will contain duplicates.
Step 7: Transform the data
Do this to put all the data in the same case:
- From the top ribbon, select the Transform tab.
- Select the column with the inconsistent cases.
- Click on Format in the Power Query ribbon.
- Next, click on Capitalize Each Word.
You will be asked if you are sure you want to insert a step.
Click on the Insert button.
Step 8: Select Remove Duplicates
This is the final step in the Power Query method.
- You will find the Applied Steps option on the right side of Power Query.
- Click on Remove Duplicates.
- Select Close and Load from the top-left corner of the page.
- Closing the Power Query will create a new table.
You will get a table that is free of duplicates and the casing issue.
The Power Query method is quite complex. However, the methods given below are easier to follow.
Method #2: Removing Duplicates Using the Remove Duplicates Icon
A very quick and efficient way to remove duplicates, if it is a once-off exercise. To show you how it works, we suggest you make a copy of your spreadsheet first. Doing this will allow you to compare the final output with your original dataset.
Step 1: Create a copy of your dataset
- Select your entire dataset. To do this, select the first column while holding the Shift button.
- Next, press the right direction arrow button. Keep on pressing it until the selected area covers the dataset’s columns containing duplicate data.
- Press CTRL + C on your keyboard to copy the data.
- Click on another blank area in your spreadsheet where you want the copy to be placed.
- Press CTRL + V to paste the data into these empty columns.
Step 2: Click Data
With your copied columns still selected (i.e., don’t click elsewhere on your spreadsheet) select the Data tab in the top menu.
This will reveal the Remove Duplicates icon in the top ribbon.
Next, click on the icon to get a pop-up menu.
Step 3: Select the columns that contain duplicates
- Check the boxes of all the columns that contain duplicate data.
- Check the My data has headers box.
- Finally, click on the OK button.
Step 4: Compare the new dataset with the old dataset
Excel has removed the duplicates.
Compare with your original table to see if all the duplicates were omitted.
Method #3: Removing Duplicates Using the UNIQUE Function
The UNIQUE function works with Office365. It also works with Microsoft Web. This function is a formula. To use it, follow the steps given below.
Step 1: Enter the UNIQUE formula into an empty cell
First, select an empty cell and enter this formula [=UNIQUE()] into it.
Step 2: Enter the table array
Enter the array of your data inside the formula and press Enter.
The syntax in our example below is [=UNIQUE(A1:C20)].
You will get a second dataset that has no duplicates.
Step 3: Copy the format of the dataset
The UNIQUE function doesn’t copy the format of the dataset by default. To copy the format, do the following:
- Select all the columns in the original dataset with duplicates.
- Click on the Format Painter icon which is on the main ribbon.
- Now, select all the columns in the new dataset.
The UNIQUE function updates when you insert extra rows into the dataset.
Method #4: Removing Duplicates Using Conditional Formatting and Filter Option
Step 1: Select the data
Drag your cursor across the range of cells.
Step 2: Highlight duplicate values using conditional formatting
Under the Home tab, click on the Conditional Formatting icon in the ribbon.
A dropdown menu will appear.
Move your mouse over Highlight Cells Rules. And then to the right where more options will appear.
Select Duplicate Values…
The dialog box that appears will ask you what color you want to apply to the duplicates.
Should you wish to, you can change the color using the dropdown.
Click on the OK Button.
This will highlight all the duplicate data in the selected range of cells.
Step 3: Create a filter to remove duplicates
Select the header cells only, in our example that would be ‘Sales Agent’ and ‘Sales’.
Once done, under the ‘Home’ tab, click on the Sort & Filter icon.
Select Filter.
Next to every title that you have selected, there will now be a small square with a downward pointing triangle. This is the filter icon.
Clicking on the Filter icon will give you a drop-down menu.
Click on any of the arrows and select ‘Filter by Color’.
Tap on the color that you used to highlight the duplicates, in our example that would be pink.
Now your data will be filtered and show you only the duplicates.
Step 4: Delete the duplicate rows manually
Now it is easy to delete the duplicates. Right-click on the row number on the left and select Delete Row.
Once you have deleted the duplicates, the highlights will be gone as there are no duplicates.
Step 5: Remove the conditional formatting and the filter
With the task complete, you need to remove the formatting and filter.
To remove the filter, click on the same Filter Icon.
Select Clear filter.
This would remove the filter and you can now see all the rows again.
To remove the formatting, click on Conditional Formatting in the ribbon under the Home tab.
From the dropdown that appears, move your mouse down to Clear Rules.
More options will appear, from which you will select Clear Rules from Entire Sheet.
This would remove all the formatting and restore your sheet without duplicates.
You can now see your data without the duplicates.
Method #5: Removing Duplicates Using the COUNTIF Function and Filter Option
Step 1: Add another column to find the duplicates
You need to use another column for the duplicates. In our example, we’ve used column C to identify the duplicates and have given it a heading – Duplicates.
Step 2: Apply the COUNTIF function
The syntax for the COUNTIF function is [=COUNTIF(range,criteria)].
- In our example the range is A2 to B16.
- We use our first cell as the criteria.
- Whenever there’s a duplicate, the COUNTIF function increments its count in column C.
- If the Duplicates column shows the number 1, that means it’s a unique value.
- Any number other than that is the number of occurrences of that value.
- Here, the function would be =COUNTIF(A2:B16,A2)
Press Enter and you will see the result for the first row.
You can now apply that formula to the remaining cells of the Duplicates column.
Step 3: Filter the duplicate data
Select all the rows of your table.
In the ribbon under the Home tab, click on the Sort & Filter icon.
Next to every title that you have selected, there will now be a small square with a downward pointing triangle. This is the filter icon.
Clicking on the Filter icon will give you a drop-down menu.
Select the Filter icon of the Duplicates column.
Uncheck the checkbox with the number 1 in. Leave everything else ticked.
Click on the OK button.
Step 4: Delete the duplicate data
Select all the row numbers on the left that are currently showing.
Right-click, and from the menu that appears, select Delete Row.
This would delete all the duplicates.
Step 5: Remove the filter
Tap on the Filter icon again and select Clear Filter From “Duplicates.”
This will show you your data without any duplicates.
Wrapping up
Removing duplicates is a vital step in cleaning up any dataset and can help you make more sense of your data.
In some instances, you may need duplicate data to get useful insights out of your dataset.
Knowing how to remove duplicates is a valuable skill in a data analyst’s skill set.
The UNIQUE function is the easiest method. However, it does not work on older Microsoft Excel versions.
You can use the online Excel version at office.com if you don’t have a newer version.