Mati Codes
# _  _ ____ ___ _    ____ ____ ___  ____ ____ #
# |\/| |__|  |  |    |    |  | |  \ |___ [__  #
# |  | |  |  |  |    |___ |__| |__/ |___ ___] #
#                                             #
# _  _ ____ ___ _    ____ ____ ___  ____ ____ #
# |\/| |__|  |  |    |    |  | |  \ |___ [__  #
# |  | |  |  |  |    |___ |__| |__/ |___ ___] #
#                                             #

July 8, 2020

HOW TO REMOVE DUPLICATES FROM A LIST IN PYTHON


Let's look at some pythonic ways to remove duplicates from a Python list.

Often when you're manipulating and pipelining data, you'll want to reduce down to just the unique occurrences by removing duplicates. A simple and pythonic way to do that is to use sets. Python sets by definition are a collection of unique items. They can't have duplicates.

nums = [5, 3, 1, 1, 2, 5]
# convert the list to a set and back again
unique = list(set(nums))

# Result:
>>> nums
[5, 3, 1, 1, 2, 5]
>>> unique
[1, 2, 3, 5]

This is my favorite way to remove duplicates because it's easy to understand and intuitive. Unfortunately, it doesn't preserve ordering and in some cases that may be important to preserve. To remove duplicates and maintain order you can convert into a dictionary instead. The idea is that the items in your list are converted to keys in a dictionary and since dictionary keys must be unique, you can convert the keys back into a list without duplicates.

# Python 2.7 & <=3.5 use OrderedDict
from collections import OrderedDict
nums = [5, 3, 1, 1, 2, 5]
unique = list(OrderedDict.fromkeys(nums))

# Python 3.6+ dict objects are now ordered!
nums = [5, 3, 1, 1, 2, 5]
unique = list(dict.fromkeys(nums))

# Result:
>>> nums
[5, 3, 1, 1, 2, 5]
>>> unique
[5,3,1,2]

Both of these methods require hashable objects. If you're trying to remove duplicates of custom objects that you created (e.g. Student objects) then overriding the __eq__ function and doing it the old-fashion way is probably the most straight forward.

class Student(object):
    def __init__(self, first, last, id):
        self.first = first
        self.last = last
        self.id = id
    def __eq__(self, other):
       """
       You need to override this function
       so that the "in" operator is able
       to properly compare your complex objects
       """
       return self.id == other.id

# Create your list of students
students = [
    Student("John", "Smith", 0),
    Student("Mary", "Wallace", 1),
    Student("Susie", "Ford", 2),
    Student("Mary", "Wallace", 1)
]

# Loop through the list and only add
# students you haven't seen to a new list
unique = []
for student in students: 
    if student not in unique: 
        unique.append(student)
>>>

Tags

python, data pipeline, lists