python - A better way to aggregate data and keep table structure and column names with Pandas -

- May 15, 2010

suppose have dataset following

df = pd.dataframe({'x1':['a','a','b','b'], 'x2':[true, true, true, false], 'x3':[1,1,1,1]}) df   x1     x2  x3 0    true   1 1    true   1 2  b   true   1 3  b  false   1

i want perform groupby-aggregate operation group multiple columns , apply multiple functions 1 column. furthermore, don't want multi-indexed, multi-level table. accomplish this, it's taking me 3 lines of code seems excessive.

for example

bg = df.groupby(['x1', 'x2']).agg({'x3': {'my_sum':np.sum, 'my_mean':np.mean}}) bg.columns = bg.columns.droplevel(0) bg.reset_index()

is there better way? not gripe, i'm coming r/data.table background nice one-liner like

df[, list(my_sum=sum(x3), my_mean=mean(x3)), by=list(x1, x2)]

you use @happy01 answer instead of as_index=false add reset_index end:

in [1331]: df.groupby(['x1', 'x2'])['x3'].agg( {'my_sum':np.sum, 'my_mean':np.mean}).reset_index() out[1331]:    x1     x2  my_mean  my_sum 0    true        1       2 1  b  false        1       1 2  b   true        1       1

benchmarking, reset_index works faster:

in [1333]: %timeit df.groupby(['x1', 'x2'], as_index=false)['x3'].agg({'my_sum':np.sum, 'my_mean':np.mean}) 100 loops, best of 3: 3.18 ms per loop  in [1334]: %timeit df.groupby(['x1', 'x2'])['x3'].agg( {'my_sum':np.sum, 'my_mean':np.mean}).reset_index() 100 loops, best of 3: 2.82 ms per loop

you same solution 1 line. transpose dataframe reset_index drop x3 column or level 0, transposing , reset_index again achieve desired output:

in [1374]: df.groupby(['x1', 'x2']).agg({'x3': {'my_sum':np.sum, 'my_mean':np.mean}}).t.reset_index(level=0, drop=true).t.reset_index() out[1374]:    x1     x2  my_mean  my_sum 0    true        1       2 1  b  false        1       1 2  b   true        1       1

but works slower:

in [1375]: %timeit df.groupby(['x1', 'x2']).agg({'x3': {'my_sum':np.sum, 'my_mean':np.mean}}).t.reset_index(level=0, drop=true).t.reset_index() 100 loops, best of 3: 5.13 ms per loop

Search This Blog

If code

python - A better way to aggregate data and keep table structure and column names with Pandas -

Comments

Post a Comment

Popular posts from this blog

multithreading - Exception in Application constructor -

React Native allow user to reorder elements in a scrollview list -

windows - CertCreateCertificateContext returns CRYPT_E_ASN1_BADTAG / 8009310b -