algorithm - Python: find a duplicate in a container efficiently -

- January 15, 2014

i have container cont. if want find out if has duplicates, i'll check len(cont) == len(set(cont)).

suppose want find duplicate element if exists (just arbitrary duplicate element). there neat , efficient way write that?

[python 3]

ok, first answer has gotten quite bit of flack, thought i'd try few different ways of doing , report differences. here's code.

import sys import itertools  def getfirstdup(c, totest):      # original idea using list slicing => 5.014 s     if totest == '1':         in xrange(0, len(c)):             if c[i] in c[:i]:                 return c[i]      # using 2 sets => 4.305 s     elif totest == '2':         s = set()         in c:             s2 = s.copy()             s.add(i)             if len(s) == len(s2):                 return      # using dictionary lut => 0.763 s     elif totest == '3':         d = {}         in c:             if in d:                 return             else:                 d[i] = 1      # using set operations => 0.772 s     elif totest == '4':         s = set()         in c:             if in s:                 return             else:                 s.add(i)      # sorting walking => 5.130 s     elif totest == '5':         c = sorted(c)         in xrange(1, len(c)):             if c[i] == c[i - 1]:                 return c[i]      # sorting groupby-ing => 5.086 s     else:         c = sorted(c)         k, g in itertools.groupby(c):             if len(list(g)) > 1:                 return k      return none   c = list(xrange(0, 10000000)) c[5000] = 0  in xrange(0, 10):     print getfirstdup(c, sys.argv[1])

basically, try in 6 different ways, listed in source file. used linux time command , collected realtime runtimes, running commands so

time python ./test.py 1

with 1 being algorithm wanted try. each algorithm looks first duplicate in 10,000,000 integers, , runs ten times. there 1 duplication in list, "mostly sorted" though did try reverse sorted lists without noticing proportional difference between algorithms.

my original suggestion did poorly @ 5.014 s. understanding of icyrock.com's solution did poorly @ 4.305 s. next tried using dictionary create lut, gave best runtime @ 0.763 s. tried using in operator on sets, , got 0.772 s, dictionary lut. tried sorting , walking list, gave pitiful time of 5.130 s. finally, tried john machin's suggestion of itertools, gave poor time of 5.086 s.

in summary, dictionary lut seems way go, set operations (which may use luts in implementation) being close second.

update: tried razpeitia's suggestion, , apart fact need know precisely duplicate key you're looking for, actual algorithm did worst far (66.366 s).

update 2: i'm sure going test biased because duplicate location near 1 end of list. try running code using different location before downvoting , report results!

Search This Blog

Nine

algorithm - Python: find a duplicate in a container efficiently -

Comments

Post a Comment

Popular posts from this blog

c++ - How to modify context menu of internet explorer using IDocHostUIHandler::ShowContextMenu? -

android - Spacing between the stars of a rating bar? -

c# - Getting "Internal .Net Framework Data Provider error 30" error when column has NULL value -