performance - Python: faster operation for indexing -


i have following snippet extracts indices of unique values (hashable) in sequence-like data canonical indices , store them in dictionary lists:

from collections import defaultdict idx_lists = defaultdict(list) idx, ele in enumerate(data):     idx_lists[ele].append(idx) 

this looks me quite common use case. , happens 90% of execution time of code spent in these few lines. part passed through on 10000 times during execution, , len(data) around 50000 100000 each time run. number of unique elements ranges 50 150 roughly.

is there faster way, perhaps vectorized/c-extended (e.g. numpy or pandas methods), achieves same thing?

many many thanks.

i found question pretty interesting , while wasn't able large improvement on other proposed methods did find pure numpy method faster other proposed methods.

import numpy np import pandas pd collections import defaultdict  data = np.random.randint(0, 10**2, size=10**5) series = pd.series(data)  def get_values_and_indicies(input_data):     input_data = np.asarray(input_data)     sorted_indices = input_data.argsort() # sorted indices     # sorted data can see values change     sorted_data = input_data[sorted_indices]     # find locations values change , include first , last values     run_endpoints = np.concatenate(([0], np.where(sorted_data[1:] != sorted_data[:-1])[0] + 1, [len(input_data)]))     # unique values     unique_vals = sorted_data[run_endpoints[:-1]]     # return unique values along indices associated value     return {unique_vals[i]: sorted_indices[run_endpoints[i]:run_endpoints[i + 1]].tolist() in range(num_values)}   def by_dd(input_data):     idx_lists = defaultdict(list)     idx, ele in enumerate(input_data):         idx_lists[ele].append(idx)     return idx_lists  def by_pand1(input_data):     idx_lists = defaultdict(list)     return {k: v.tolist() k,v in series.groupby(input_data).indices.items()}  def by_pand2(input_data):     return series.groupby(input_data).indices  def data_to_idxlists(input_data):     u, ixs = np.unique(input_data, return_inverse=true)     return {u: np.nonzero(ixs==i) i, u in enumerate(u)}  def data_to_idxlists_unique(input_data):     sorting_ixs = np.argsort(input_data)     uniques, unique_indices = np.unique(input_data[sorting_ixs], return_index = true)     return {u: sorting_ixs[start:stop] u, start, stop in zip(uniques, unique_indices, list(unique_indices[1:])+[none])} 

the resulting timings (from fastest slowest):

>>> %timeit get_values_and_indicies(data) 100 loops, best of 3: 4.25 ms per loop >>> %timeit by_pand2(series) 100 loops, best of 3: 5.22 ms per loop >>> %timeit data_to_idxlists_unique(data) 100 loops, best of 3: 6.23 ms per loop >>> %timeit by_pand1(series) 100 loops, best of 3: 10.2 ms per loop >>> %timeit data_to_idxlists(data) 100 loops, best of 3: 15.5 ms per loop >>> %timeit by_dd(data) 10 loops, best of 3: 21.4 ms per loop 

and should noted unlike by_pand2 results dict of lists given in example. if prefer return defaultdict can change last time return defaultdict(list, ((unique_vals[i], sorted_indices[run_endpoints[i]:run_endpoints[i + 1]].tolist()) in range(num_values))) increased overall timing in tests 4.4 ms.

lastly, should note these timing data sensitive. when used 10 different values got:

  1. get_values_and_indicies: 4.34 ms per loop
  2. data_to_idxlists_unique: 4.42 ms per loop
  3. by_pand2: 4.83 ms per loop
  4. data_to_idxlists: 6.09 ms per loop
  5. by_pand1: 9.39 ms per loop
  6. by_dd: 22.4 ms per loop

while if used 10,000 different values got:

  1. get_values_and_indicies: 7.00 ms per loop
  2. data_to_idxlists_unique: 14.8 ms per loop
  3. by_dd: 29.8 ms per loop
  4. by_pand2: 47.7 ms per loop
  5. by_pand1: 67.3 ms per loop
  6. data_to_idxlists: 869 ms per loop

Comments

Popular posts from this blog

how to insert data php javascript mysql with multiple array session 2 -

multithreading - Exception in Application constructor -

windows - CertCreateCertificateContext returns CRYPT_E_ASN1_BADTAG / 8009310b -