python - Sum amount associated with item in a column of lists -


i have dataframe follows:

country   letter             keywords                 amount    c         y     ['fruits', 'apples', "banana"]       700    c         y     ["music", "dance", "banana"]         150    c         y     ['loud', "dance", "apples"]          350 

i'd sum amount associated each keyword. note: country , letters not same, in contrived data above. also, list of keywords vary in size.

i've tried several solutions. i've attached fastest 1 below. i've tried solutions apply , defaultdicts...

keywords_list = [] in zip(*[df[c] c in df.columns]):     data = list(i[0:2])     k in i[2]:         row = [k] + data + [i[-1]]         keywords_list.append(row)  df_expanded = pd.dataframe(keywords_list) df_expanded.groupby(list(range(3)))[3].sum().reset_index() 

goal

  country letter keywords  amount 0       c      y   apples    1050 1       c      y   banana     850 2       c      y    dance     500 3       c      y   fruits     700 4       c      y     loud     350 5       c      y    music     150 

edit: corrected error in example of goal


data

country = list("ccc") letters = list("yyy") keywords = [['fruits', 'apples', "banana"], ["music", "dance", "banana"], ['loud', "dance", "apples"]] amount = [700, 150, 350]  df = pd.dataframe({"country" : country, "keywords": keywords, "letter" : letters, "amount" : amount}) df = df[['country', 'letter', 'keywords', 'amount']] 

you can use:

df1 = pd.dataframe(df.keywords.values.tolist())        .stack()        .reset_index(level=1, drop=true)        .rename('keywords') print (df1) 0    fruits 0    apples 0    banana 1     music 1     dance 1    banana 2      loud 2     dance 2    apples name: keywords, dtype: object  print (df.drop('keywords', axis=1).join(df1).reset_index(drop=true))   country letter  amount keywords 0       c      y     700   fruits 1       c      y     700   apples 2       c      y     700   banana 3       c      y     150    music 4       c      y     150    dance 5       c      y     150   banana 6       c      y     350     loud 7       c      y     350    dance 8       c      y     350   apples 

another solution:

df = df.set_index(['country','letter','amount']) df1 = pd.dataframe(df.keywords.values.tolist(), index = df.index) \         .stack() \         .reset_index(name='keywords') \         .drop('level_3',axis=1) print (df1)   country letter  amount keywords 0       c      y     700   fruits 1       c      y     700   apples 2       c      y     700   banana 3       c      y     150    music 4       c      y     150    dance 5       c      y     150   banana 6       c      y     350     loud 7       c      y     350    dance 8       c      y     350   apples 

then need groupby aggregating sum:

print (df.groupby(['country','letter','keywords'], as_index=false)['amount'].sum())   country letter keywords  amount 0       c      y   apples    1050 1       c      y   banana     850 2       c      y    dance     500 3       c      y   fruits     700 4       c      y     loud     350 5       c      y    music     150 

timings:

in [47]: %timeit (df.set_index(['country','letter','amount']).keywords.apply(pd.series).stack().reset_index().drop('level_3',1)) 1 loop, best of 3: 4.55 s per loop  in [48]: %timeit (jez1(df3)) 10 loops, best of 3: 24.8 ms per loop  in [49]: %timeit (jez2(df3)) 10 loops, best of 3: 29.7 ms per loop 

code timings:

df = pd.concat([df]*10000).reset_index(drop=true) df3 = df.copy() df4 = df.copy()                  def jez1(df):     df1 = pd.dataframe(df.keywords.values.tolist()).stack().reset_index(level=1, drop=true).rename('keywords')     return df.drop('keywords', axis=1).join(df1).reset_index(drop=true)  def jez2(df):     df = df.set_index(['country','letter','amount'])     df1 = pd.dataframe(df.keywords.values.tolist(), index = df.index).stack().reset_index(name='keywords').drop('level_3',axis=1)     return df1 

thank maxu improvement pop - drop not necessary. unfortunately timing failed (keyerror: 'keywords'), cannot compare it.


Comments

Popular posts from this blog

asynchronous - C# WinSCP .NET assembly: How to upload multiple files asynchronously -

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -