python - How to count non NaN values accross columns in pandas dataframe? -


my data looks this:

            close     b   c   d   e   time     2015-12-03  2051.25 5   4   3   1   1   05:00:00     2015-12-04  2088.25 5   4   3   1   nan 06:00:00 2015-12-07  2081.50 5   4   3   nan nan 07:00:00 2015-12-08  2058.25 5   4   nan nan nan 08:00:00 2015-12-09  2042.25 5   nan nan nan nan 09:00:00 

i need count 'horizontally' values in columns ['a'] ['e'] not nan. outcome this:

df['count'] = ..... df              close     b   c   d   e   time     count 2015-12-03  2051.25 5   4   3   1   1   05:00:00 5   2015-12-04  2088.25 5   4   3   1   nan 06:00:00 4 2015-12-07  2081.50 5   4   3   nan nan 07:00:00 3 2015-12-08  2058.25 5   4   nan nan nan 08:00:00 2 2015-12-09  2042.25 5   nan nan nan nan 09:00:00 1 

thanks

you can subselect df , call count passing axis=1:

in [24]: df['count'] = df[list('abcde')].count(axis=1) df  out[24]:               close    b   c   d   e      time  count 2015-12-03  2051.25  5   4   3   1   1  05:00:00      5 2015-12-04  2088.25  5   4   3   1 nan  06:00:00      4 2015-12-07  2081.50  5   4   3 nan nan  07:00:00      3 2015-12-08  2058.25  5   4 nan nan nan  08:00:00      2 2015-12-09  2042.25  5 nan nan nan nan  09:00:00      1 

timings

in [25]: %timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1) %timeit df.drop(['close', 'time'], axis=1).count(axis=1) %timeit df[list('abcde')].count(axis=1)  100 loops, best of 3: 3.28 ms per loop 100 loops, best of 3: 2.76 ms per loop 100 loops, best of 3: 2.98 ms per loop 

apply slowest not surprise, drop version marginally faster semantically prefer passing list of cols of interest , calling count readability

hmm keep getting varying timings now:

in [27]: %timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1) %timeit df.drop(['close', 'time'], axis=1).count(axis=1) %timeit df[list('abcde')].count(axis=1) %timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1)  100 loops, best of 3: 3.33 ms per loop 100 loops, best of 3: 2.7 ms per loop 100 loops, best of 3: 2.7 ms per loop 100 loops, best of 3: 2.57 ms per loop 

more timings

in [160]: %timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1) %timeit df.drop(['close', 'time'], axis=1).count(axis=1) %timeit df[list('abcde')].count(axis=1) %timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1) %timeit df[list('abcde')].notnull().sum(axis=1)   1000 loops, best of 3: 1.4 ms per loop 1000 loops, best of 3: 1.14 ms per loop 1000 loops, best of 3: 1.11 ms per loop 1000 loops, best of 3: 1.11 ms per loop 1000 loops, best of 3: 1.05 ms per loop 

it seems testing notnull , summing (as notnull produce boolean mask) quicker on dataset

on 50k row df last method quicker:

in [172]: %timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1) %timeit df.drop(['close', 'time'], axis=1).count(axis=1) %timeit df[list('abcde')].count(axis=1) %timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1) %timeit df[list('abcde')].notnull().sum(axis=1)   1 loops, best of 3: 5.83 s per loop 100 loops, best of 3: 6.15 ms per loop 100 loops, best of 3: 6.49 ms per loop 100 loops, best of 3: 6.04 ms per loop 

Comments

Popular posts from this blog

how to insert data php javascript mysql with multiple array session 2 -

multithreading - Exception in Application constructor -

windows - CertCreateCertificateContext returns CRYPT_E_ASN1_BADTAG / 8009310b -