python - A Faster Way of Removing Unused Categories in Pandas? -


i'm running models in python, data subset on categories.

for memory usage, , preprocessing, categorical variables stored category data type.

for each level of categorical variable in 'group by' column, running regression, need reset categorical variables present in subset.

i doing using .cat.remove_unused_categories(), taking 50% of total runtime. @ moment, worst offender grouping column, others not taking time (as guess there not many levels drop).

here simplified example:

import itertools import pandas pd #generate fake data alphabets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'] keywords = [''.join(i) in itertools.product(alphabets, repeat = 2)] z = pd.dataframe({'x':keywords})  #convert category datatype z.x = z.x.astype('category')  #groupby z = z.groupby('x')  #loop on groups in z.groups:     x = z.get_group(i)     x.x = x.x.cat.remove_unused_categories()     #run fancy model here 

on laptop, takes 20 seconds. small example, convert str, category speed up, real data has @ least 300 lines per group.

is possible speed loop? have tried using x.x = x.x.cat.set_categories(i) takes similar time, , x.x.cat.categories = i, asks same number of categories started with.

your problem in assigning z.get_group(i) x. x copy of portion of z. code work fine change

for in z.groups:     x = z.get_group(i).copy() # no longer tied z     x.x = x.x.cat.remove_unused_categories() 

Comments

Popular posts from this blog

asynchronous - C# WinSCP .NET assembly: How to upload multiple files asynchronously -

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -