python - Sum amount associated with item in a column of lists -
i have dataframe follows:
country letter keywords amount c y ['fruits', 'apples', "banana"] 700 c y ["music", "dance", "banana"] 150 c y ['loud', "dance", "apples"] 350
i'd sum amount associated each keyword. note: country
, letters
not same, in contrived data above. also, list of keywords
vary in size.
i've tried several solutions. i've attached fastest 1 below. i've tried solutions apply
, defaultdicts
...
keywords_list = [] in zip(*[df[c] c in df.columns]): data = list(i[0:2]) k in i[2]: row = [k] + data + [i[-1]] keywords_list.append(row) df_expanded = pd.dataframe(keywords_list) df_expanded.groupby(list(range(3)))[3].sum().reset_index()
goal
country letter keywords amount 0 c y apples 1050 1 c y banana 850 2 c y dance 500 3 c y fruits 700 4 c y loud 350 5 c y music 150
edit: corrected error in example of goal
data
country = list("ccc") letters = list("yyy") keywords = [['fruits', 'apples', "banana"], ["music", "dance", "banana"], ['loud', "dance", "apples"]] amount = [700, 150, 350] df = pd.dataframe({"country" : country, "keywords": keywords, "letter" : letters, "amount" : amount}) df = df[['country', 'letter', 'keywords', 'amount']]
you can use:
df1 = pd.dataframe(df.keywords.values.tolist()) .stack() .reset_index(level=1, drop=true) .rename('keywords') print (df1) 0 fruits 0 apples 0 banana 1 music 1 dance 1 banana 2 loud 2 dance 2 apples name: keywords, dtype: object print (df.drop('keywords', axis=1).join(df1).reset_index(drop=true)) country letter amount keywords 0 c y 700 fruits 1 c y 700 apples 2 c y 700 banana 3 c y 150 music 4 c y 150 dance 5 c y 150 banana 6 c y 350 loud 7 c y 350 dance 8 c y 350 apples
another solution:
df = df.set_index(['country','letter','amount']) df1 = pd.dataframe(df.keywords.values.tolist(), index = df.index) \ .stack() \ .reset_index(name='keywords') \ .drop('level_3',axis=1) print (df1) country letter amount keywords 0 c y 700 fruits 1 c y 700 apples 2 c y 700 banana 3 c y 150 music 4 c y 150 dance 5 c y 150 banana 6 c y 350 loud 7 c y 350 dance 8 c y 350 apples
then need groupby
aggregating sum
:
print (df.groupby(['country','letter','keywords'], as_index=false)['amount'].sum()) country letter keywords amount 0 c y apples 1050 1 c y banana 850 2 c y dance 500 3 c y fruits 700 4 c y loud 350 5 c y music 150
timings:
in [47]: %timeit (df.set_index(['country','letter','amount']).keywords.apply(pd.series).stack().reset_index().drop('level_3',1)) 1 loop, best of 3: 4.55 s per loop in [48]: %timeit (jez1(df3)) 10 loops, best of 3: 24.8 ms per loop in [49]: %timeit (jez2(df3)) 10 loops, best of 3: 29.7 ms per loop
code timings:
df = pd.concat([df]*10000).reset_index(drop=true) df3 = df.copy() df4 = df.copy() def jez1(df): df1 = pd.dataframe(df.keywords.values.tolist()).stack().reset_index(level=1, drop=true).rename('keywords') return df.drop('keywords', axis=1).join(df1).reset_index(drop=true) def jez2(df): df = df.set_index(['country','letter','amount']) df1 = pd.dataframe(df.keywords.values.tolist(), index = df.index).stack().reset_index(name='keywords').drop('level_3',axis=1) return df1
thank maxu
improvement pop
- drop
not necessary. unfortunately timing
failed (keyerror: 'keywords'
), cannot compare it.
Comments
Post a Comment