list - Python: Get unique tuples such that elements across all tuples are disjoint -
there's list of tuples:
all_tups = [(1, 2), (2, 1), (3, 4), (3, 5), (4, 3), (6, 8), (6, 7), (7, 9)]
what wish do: tuples such if in tuple, element has appeared, other tuple element should discarded (regardless of position of element in tuple).
so, there several desired outputs possible:
[(1,2) (3,4) (6,8) (7,9)] or [(2,1) (4,3) (6,8) (7,9)]
and on.
originally, first element of each tuple comes 1 column of pandas dataframe , second element of each tuple comes column of same dataframe.
c1 c2 0 1 2 1 2 1 2 3 4 3 3 5 4 4 3 5 6 8 6 6 7 7 7 9
in actual problem, there millions of rows in dataframe. therefore, not looking for-loop based solution. approach works on dataframe or millions of tuples fine except for-loop based solution.
what have tried far: have been able obtain list of unique tuples using frozen sets:
uniq_tups = {frozenset(k) k in all_tups}
(admittedly using list comprehensions ideally avoid). gives me:
{frozenset({1, 2}), frozenset({6, 7}), frozenset({3, 5}), frozenset({3, 4}), frozenset({6, 8}), frozenset({7, 9})}
i can't seem non-for loop way of making progress solution, or use other approach avoids looping.
i'm using python 3.5, have no issues using python 2.7 solution well.
thanks in advance inputs.
here's reasonably efficient way in plain python. use function not_seen
test if tuple contains elements have been seen in accepted tuple.
all_tups = [(1, 2), (2, 1), (3, 4), (3, 5), (4, 3), (6, 8), (6, 7), (7, 9)] def not_seen(t, seen=set()): if t[0] in seen or t[1] in seen: return false seen.update(t) return true unique = list(filter(not_seen, all_tups)) print(unique)
output
[(1, 2), (3, 4), (6, 8), (7, 9)]
there's slight problem not_seen
: uses default mutable argument seen
cache seen elements, , set cannot cleared, if need perform operation again seen
still hold old elements. could make seen
global instead, run slower. option use factory function produce clean version of seen
each time need one. eg:
def make_checker(): def not_seen(t, seen=set()): if t[0] in seen or t[1] in seen: return false seen.update(t) return true return not_seen not_seen = make_checker()
fwiw, here's compact version of not_seen
; should almost efficient original, i'd surprised if it's faster. :)
def not_seen(t, seen=set()): return false if t[0] in seen or t[1] in seen else seen.update(t) or true
we can convert compact version lambda, , don't have worry problem of clearing seen
set.
all_tups = [(1, 2), (2, 1), (3, 4), (3, 5), (4, 3), (6, 8), (6, 7), (7, 9)] unique = list(filter(lambda t, seen=set(): false if t[0] in seen or t[1] in seen else seen.update(t) or true, all_tups)) print(unique)
here numpy implementation. first convert data 2d numpy array. use not_seen
numpy.apply_along_axis
create numpy boolean array denoting pairs should accepted, , use boolean array select desired pairs.
import numpy np def not_seen(t, seen=set()): if t[0] in seen or t[1] in seen: return false seen.update(t) return true all_tups = [(1, 2), (2, 1), (3, 4), (3, 5), (4, 3), (6, 8), (6, 7), (7, 9)] all_tups = np.array(all_tups) print(all_tups, all_tups.dtype) print('- ' * 20) filtered = all_tups[np.apply_along_axis(not_seen, 1, all_tups)] print(filtered)
output
[[1 2] [2 1] [3 4] [3 5] [4 3] [6 8] [6 7] [7 9]] int32 - - - - - - - - - - - - - - - - - - - - [[1 2] [3 4] [6 8] [7 9]]
this should faster plain python implementation above. looping process should faster, bottleneck we're still calling not_seen
plain python function. also, uses more ram because has construct boolean array.
update
it is possible clear seen
set outside not_seen
function. can access function's default arguments via .__default__
attribute (or .func_defaults
in old versions of python 2; .__default__
works in python 2.6, not in 2.5).
eg,
not_seen.__defaults__[0].clear()
Comments
Post a Comment