python - A Faster Way of Removing Unused Categories in Pandas? -
i'm running models in python, data subset on categories.
for memory usage, , preprocessing, categorical variables stored category data type.
for each level of categorical variable in 'group by' column, running regression, need reset categorical variables present in subset.
i doing using .cat.remove_unused_categories()
, taking 50% of total runtime. @ moment, worst offender grouping column, others not taking time (as guess there not many levels drop).
here simplified example:
import itertools import pandas pd #generate fake data alphabets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'] keywords = [''.join(i) in itertools.product(alphabets, repeat = 2)] z = pd.dataframe({'x':keywords}) #convert category datatype z.x = z.x.astype('category') #groupby z = z.groupby('x') #loop on groups in z.groups: x = z.get_group(i) x.x = x.x.cat.remove_unused_categories() #run fancy model here
on laptop, takes 20 seconds. small example, convert str, category speed up, real data has @ least 300 lines per group.
is possible speed loop? have tried using x.x = x.x.cat.set_categories(i)
takes similar time, , x.x.cat.categories = i
, asks same number of categories started with.
your problem in assigning z.get_group(i)
x
. x
copy of portion of z
. code work fine change
for in z.groups: x = z.get_group(i).copy() # no longer tied z x.x = x.x.cat.remove_unused_categories()
Comments
Post a Comment