python - A better way to aggregate data and keep table structure and column names with Pandas -
suppose have dataset following
df = pd.dataframe({'x1':['a','a','b','b'], 'x2':[true, true, true, false], 'x3':[1,1,1,1]}) df x1 x2 x3 0 true 1 1 true 1 2 b true 1 3 b false 1 i want perform groupby-aggregate operation group multiple columns , apply multiple functions 1 column. furthermore, don't want multi-indexed, multi-level table. accomplish this, it's taking me 3 lines of code seems excessive.
for example
bg = df.groupby(['x1', 'x2']).agg({'x3': {'my_sum':np.sum, 'my_mean':np.mean}}) bg.columns = bg.columns.droplevel(0) bg.reset_index() is there better way? not gripe, i'm coming r/data.table background nice one-liner like
df[, list(my_sum=sum(x3), my_mean=mean(x3)), by=list(x1, x2)]
you use @happy01 answer instead of as_index=false add reset_index end:
in [1331]: df.groupby(['x1', 'x2'])['x3'].agg( {'my_sum':np.sum, 'my_mean':np.mean}).reset_index() out[1331]: x1 x2 my_mean my_sum 0 true 1 2 1 b false 1 1 2 b true 1 1 benchmarking, reset_index works faster:
in [1333]: %timeit df.groupby(['x1', 'x2'], as_index=false)['x3'].agg({'my_sum':np.sum, 'my_mean':np.mean}) 100 loops, best of 3: 3.18 ms per loop in [1334]: %timeit df.groupby(['x1', 'x2'])['x3'].agg( {'my_sum':np.sum, 'my_mean':np.mean}).reset_index() 100 loops, best of 3: 2.82 ms per loop you same solution 1 line. transpose dataframe reset_index drop x3 column or level 0, transposing , reset_index again achieve desired output:
in [1374]: df.groupby(['x1', 'x2']).agg({'x3': {'my_sum':np.sum, 'my_mean':np.mean}}).t.reset_index(level=0, drop=true).t.reset_index() out[1374]: x1 x2 my_mean my_sum 0 true 1 2 1 b false 1 1 2 b true 1 1 but works slower:
in [1375]: %timeit df.groupby(['x1', 'x2']).agg({'x3': {'my_sum':np.sum, 'my_mean':np.mean}}).t.reset_index(level=0, drop=true).t.reset_index() 100 loops, best of 3: 5.13 ms per loop
Comments
Post a Comment