list - Python: Get unique tuples such that elements across all tuples are disjoint -


there's list of tuples:

all_tups = [(1, 2), (2, 1), (3, 4), (3, 5), (4, 3), (6, 8), (6, 7), (7, 9)] 

what wish do: tuples such if in tuple, element has appeared, other tuple element should discarded (regardless of position of element in tuple).

so, there several desired outputs possible:

[(1,2) (3,4) (6,8) (7,9)] or [(2,1) (4,3) (6,8) (7,9)]  

and on.

originally, first element of each tuple comes 1 column of pandas dataframe , second element of each tuple comes column of same dataframe.

c1 c2 0 1 2 1 2 1 2 3 4 3 3 5 4 4 3 5 6 8 6 6 7 7 7 9

in actual problem, there millions of rows in dataframe. therefore, not looking for-loop based solution. approach works on dataframe or millions of tuples fine except for-loop based solution.

what have tried far: have been able obtain list of unique tuples using frozen sets:

uniq_tups = {frozenset(k) k in all_tups} 

(admittedly using list comprehensions ideally avoid). gives me:

{frozenset({1, 2}),  frozenset({6, 7}),  frozenset({3, 5}),  frozenset({3, 4}),  frozenset({6, 8}),  frozenset({7, 9})} 

i can't seem non-for loop way of making progress solution, or use other approach avoids looping.

i'm using python 3.5, have no issues using python 2.7 solution well.

thanks in advance inputs.

here's reasonably efficient way in plain python. use function not_seen test if tuple contains elements have been seen in accepted tuple.

all_tups = [(1, 2), (2, 1), (3, 4), (3, 5), (4, 3), (6, 8), (6, 7), (7, 9)]  def not_seen(t, seen=set()):     if t[0] in seen or t[1] in seen:         return false     seen.update(t)     return true  unique = list(filter(not_seen, all_tups)) print(unique) 

output

[(1, 2), (3, 4), (6, 8), (7, 9)] 

there's slight problem not_seen: uses default mutable argument seen cache seen elements, , set cannot cleared, if need perform operation again seen still hold old elements. could make seen global instead, run slower. option use factory function produce clean version of seen each time need one. eg:

def make_checker():     def not_seen(t, seen=set()):         if t[0] in seen or t[1] in seen:             return false         seen.update(t)         return true     return not_seen  not_seen = make_checker() 

fwiw, here's compact version of not_seen; should almost efficient original, i'd surprised if it's faster. :)

def not_seen(t, seen=set()):     return false if t[0] in seen or t[1] in seen else seen.update(t) or true 

we can convert compact version lambda, , don't have worry problem of clearing seen set.

all_tups = [(1, 2), (2, 1), (3, 4), (3, 5), (4, 3), (6, 8), (6, 7), (7, 9)]  unique = list(filter(lambda t, seen=set():     false if t[0] in seen or t[1] in seen else seen.update(t) or true, all_tups)) print(unique) 

here numpy implementation. first convert data 2d numpy array. use not_seen numpy.apply_along_axis create numpy boolean array denoting pairs should accepted, , use boolean array select desired pairs.

import numpy np  def not_seen(t, seen=set()):     if t[0] in seen or t[1] in seen:         return false     seen.update(t)     return true  all_tups = [(1, 2), (2, 1), (3, 4), (3, 5), (4, 3), (6, 8), (6, 7), (7, 9)]  all_tups = np.array(all_tups) print(all_tups, all_tups.dtype) print('- ' * 20)  filtered = all_tups[np.apply_along_axis(not_seen, 1, all_tups)] print(filtered) 

output

[[1 2]  [2 1]  [3 4]  [3 5]  [4 3]  [6 8]  [6 7]  [7 9]] int32 - - - - - - - - - - - - - - - - - - - -  [[1 2]  [3 4]  [6 8]  [7 9]] 

this should faster plain python implementation above. looping process should faster, bottleneck we're still calling not_seen plain python function. also, uses more ram because has construct boolean array.


update

it is possible clear seen set outside not_seen function. can access function's default arguments via .__default__ attribute (or .func_defaults in old versions of python 2; .__default__ works in python 2.6, not in 2.5).

eg,

not_seen.__defaults__[0].clear() 

Comments

Popular posts from this blog

asynchronous - C# WinSCP .NET assembly: How to upload multiple files asynchronously -

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -