performance - Python: faster operation for indexing -
i have following snippet extracts indices of unique values (hashable) in sequence-like data canonical indices , store them in dictionary lists:
from collections import defaultdict idx_lists = defaultdict(list) idx, ele in enumerate(data): idx_lists[ele].append(idx) this looks me quite common use case. , happens 90% of execution time of code spent in these few lines. part passed through on 10000 times during execution, , len(data) around 50000 100000 each time run. number of unique elements ranges 50 150 roughly.
is there faster way, perhaps vectorized/c-extended (e.g. numpy or pandas methods), achieves same thing?
many many thanks.
i found question pretty interesting , while wasn't able large improvement on other proposed methods did find pure numpy method faster other proposed methods.
import numpy np import pandas pd collections import defaultdict data = np.random.randint(0, 10**2, size=10**5) series = pd.series(data) def get_values_and_indicies(input_data): input_data = np.asarray(input_data) sorted_indices = input_data.argsort() # sorted indices # sorted data can see values change sorted_data = input_data[sorted_indices] # find locations values change , include first , last values run_endpoints = np.concatenate(([0], np.where(sorted_data[1:] != sorted_data[:-1])[0] + 1, [len(input_data)])) # unique values unique_vals = sorted_data[run_endpoints[:-1]] # return unique values along indices associated value return {unique_vals[i]: sorted_indices[run_endpoints[i]:run_endpoints[i + 1]].tolist() in range(num_values)} def by_dd(input_data): idx_lists = defaultdict(list) idx, ele in enumerate(input_data): idx_lists[ele].append(idx) return idx_lists def by_pand1(input_data): idx_lists = defaultdict(list) return {k: v.tolist() k,v in series.groupby(input_data).indices.items()} def by_pand2(input_data): return series.groupby(input_data).indices def data_to_idxlists(input_data): u, ixs = np.unique(input_data, return_inverse=true) return {u: np.nonzero(ixs==i) i, u in enumerate(u)} def data_to_idxlists_unique(input_data): sorting_ixs = np.argsort(input_data) uniques, unique_indices = np.unique(input_data[sorting_ixs], return_index = true) return {u: sorting_ixs[start:stop] u, start, stop in zip(uniques, unique_indices, list(unique_indices[1:])+[none])} the resulting timings (from fastest slowest):
>>> %timeit get_values_and_indicies(data) 100 loops, best of 3: 4.25 ms per loop >>> %timeit by_pand2(series) 100 loops, best of 3: 5.22 ms per loop >>> %timeit data_to_idxlists_unique(data) 100 loops, best of 3: 6.23 ms per loop >>> %timeit by_pand1(series) 100 loops, best of 3: 10.2 ms per loop >>> %timeit data_to_idxlists(data) 100 loops, best of 3: 15.5 ms per loop >>> %timeit by_dd(data) 10 loops, best of 3: 21.4 ms per loop and should noted unlike by_pand2 results dict of lists given in example. if prefer return defaultdict can change last time return defaultdict(list, ((unique_vals[i], sorted_indices[run_endpoints[i]:run_endpoints[i + 1]].tolist()) in range(num_values))) increased overall timing in tests 4.4 ms.
lastly, should note these timing data sensitive. when used 10 different values got:
- get_values_and_indicies: 4.34 ms per loop
- data_to_idxlists_unique: 4.42 ms per loop
- by_pand2: 4.83 ms per loop
- data_to_idxlists: 6.09 ms per loop
- by_pand1: 9.39 ms per loop
- by_dd: 22.4 ms per loop
while if used 10,000 different values got:
- get_values_and_indicies: 7.00 ms per loop
- data_to_idxlists_unique: 14.8 ms per loop
- by_dd: 29.8 ms per loop
- by_pand2: 47.7 ms per loop
- by_pand1: 67.3 ms per loop
- data_to_idxlists: 869 ms per loop
Comments
Post a Comment