python - How to count non NaN values accross columns in pandas dataframe? -
my data looks this:
close b c d e time 2015-12-03 2051.25 5 4 3 1 1 05:00:00 2015-12-04 2088.25 5 4 3 1 nan 06:00:00 2015-12-07 2081.50 5 4 3 nan nan 07:00:00 2015-12-08 2058.25 5 4 nan nan nan 08:00:00 2015-12-09 2042.25 5 nan nan nan nan 09:00:00 i need count 'horizontally' values in columns ['a'] ['e'] not nan. outcome this:
df['count'] = ..... df close b c d e time count 2015-12-03 2051.25 5 4 3 1 1 05:00:00 5 2015-12-04 2088.25 5 4 3 1 nan 06:00:00 4 2015-12-07 2081.50 5 4 3 nan nan 07:00:00 3 2015-12-08 2058.25 5 4 nan nan nan 08:00:00 2 2015-12-09 2042.25 5 nan nan nan nan 09:00:00 1 thanks
you can subselect df , call count passing axis=1:
in [24]: df['count'] = df[list('abcde')].count(axis=1) df out[24]: close b c d e time count 2015-12-03 2051.25 5 4 3 1 1 05:00:00 5 2015-12-04 2088.25 5 4 3 1 nan 06:00:00 4 2015-12-07 2081.50 5 4 3 nan nan 07:00:00 3 2015-12-08 2058.25 5 4 nan nan nan 08:00:00 2 2015-12-09 2042.25 5 nan nan nan nan 09:00:00 1 timings
in [25]: %timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1) %timeit df.drop(['close', 'time'], axis=1).count(axis=1) %timeit df[list('abcde')].count(axis=1) 100 loops, best of 3: 3.28 ms per loop 100 loops, best of 3: 2.76 ms per loop 100 loops, best of 3: 2.98 ms per loop apply slowest not surprise, drop version marginally faster semantically prefer passing list of cols of interest , calling count readability
hmm keep getting varying timings now:
in [27]: %timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1) %timeit df.drop(['close', 'time'], axis=1).count(axis=1) %timeit df[list('abcde')].count(axis=1) %timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1) 100 loops, best of 3: 3.33 ms per loop 100 loops, best of 3: 2.7 ms per loop 100 loops, best of 3: 2.7 ms per loop 100 loops, best of 3: 2.57 ms per loop more timings
in [160]: %timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1) %timeit df.drop(['close', 'time'], axis=1).count(axis=1) %timeit df[list('abcde')].count(axis=1) %timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1) %timeit df[list('abcde')].notnull().sum(axis=1) 1000 loops, best of 3: 1.4 ms per loop 1000 loops, best of 3: 1.14 ms per loop 1000 loops, best of 3: 1.11 ms per loop 1000 loops, best of 3: 1.11 ms per loop 1000 loops, best of 3: 1.05 ms per loop it seems testing notnull , summing (as notnull produce boolean mask) quicker on dataset
on 50k row df last method quicker:
in [172]: %timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1) %timeit df.drop(['close', 'time'], axis=1).count(axis=1) %timeit df[list('abcde')].count(axis=1) %timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1) %timeit df[list('abcde')].notnull().sum(axis=1) 1 loops, best of 3: 5.83 s per loop 100 loops, best of 3: 6.15 ms per loop 100 loops, best of 3: 6.49 ms per loop 100 loops, best of 3: 6.04 ms per loop
Comments
Post a Comment