OOP in python (related to scrapy) -

- June 15, 2013

the question how share data between objecs in safe , maintainable manner.

example: i've build scrapy application spawns numerous spiders. although each spider connected separate pipeline object, need compare , sort data between different pipelines (e.g. need outputs sorted different item attributes: prices, date etc.), need shared data area. same applies spiders (e.g. need count maximum total requests). first implementation used class variables shared data between between spiders/pipelines , instance variables each object.

class mypipeline(object): max_price = 0  def process_item(self, item, spider): if item['price'] > max_price :   max_price = item['price']

(the actual structures more complex) thought out having bunch of statics not oop , next solution have private class data each class , use store values:

class mypipelinedata: def __init__(self):    self.max_price = 0  class spidersdata:   def __init___(self, total_requests, pipeline_data):     self.total_requests = total_requests     self.pipeline_data = pipeline_data #the shared data between pipelines  class mypipeline(object): pipeline_data = none  def process_item(self, item, spider):   if _data none:        _data = spider.data.pipeline_data  #the shared data between pipelines     if item['price'] > _data.max_price :     _data.max_price = item['price']   class spider(scrapy.spider):  def __init__(self, spider_data):    self._data = spider_data   # , same object of spiderdata passed spiders

now have 1 instance of data shared between pipeplines (and same spiders). correct this? should apply same oop approaches in python in c++ ?

from understand, approach proposing keep reference each object shared object captures of shared data, , and think fine, if name appropriately name suggests it's being shared, readability.

also, you're hiding internals of shared object , encapsulating them inside methods such process_item(), think important maintainability (because changes in internals of shared object don't have affect other object).

but i'm not sure way bootstrapping (i.e. initializing) shared object. have these 2 lines

if _data none:   _data = ...

which little surprising. didn't quite understand _data , defined. pipeline_data assigned none , never assigned else, i'm not sure meant there.

if possible, prefer see function called create_spiders() creates shared object, , creates different spiders 1 one, giving them reference shared object. makes logic clear.

however, in special case want shared object singleton, consider making static object in module name appropriately, maybe globals.py. , inside spider code see things like

import globals  class spiderdata:  def update(self):   self.data.price = 200   globals.spiders_data_collector.process(self.data)

inside module globals initialize object spiders_data_collector. think requires less code, , important maintainability.

Search This Blog

If code

OOP in python (related to scrapy) -

Comments

Post a Comment

Popular posts from this blog

how to insert data php javascript mysql with multiple array session 2 -

multithreading - Exception in Application constructor -

React Native allow user to reorder elements in a scrollview list -