OOP in python (related to scrapy) -
the question how share data between objecs in safe , maintainable manner.
example: i've build scrapy application spawns numerous spiders. although each spider connected separate pipeline object, need compare , sort data between different pipelines (e.g. need outputs sorted different item attributes: prices, date etc.), need shared data area. same applies spiders (e.g. need count maximum total requests). first implementation used class variables shared data between between spiders/pipelines , instance variables each object.
class mypipeline(object): max_price = 0 def process_item(self, item, spider): if item['price'] > max_price : max_price = item['price'] (the actual structures more complex) thought out having bunch of statics not oop , next solution have private class data each class , use store values:
class mypipelinedata: def __init__(self): self.max_price = 0 class spidersdata: def __init___(self, total_requests, pipeline_data): self.total_requests = total_requests self.pipeline_data = pipeline_data #the shared data between pipelines class mypipeline(object): pipeline_data = none def process_item(self, item, spider): if _data none: _data = spider.data.pipeline_data #the shared data between pipelines if item['price'] > _data.max_price : _data.max_price = item['price'] class spider(scrapy.spider): def __init__(self, spider_data): self._data = spider_data # , same object of spiderdata passed spiders now have 1 instance of data shared between pipeplines (and same spiders). correct this? should apply same oop approaches in python in c++ ?
from understand, approach proposing keep reference each object shared object captures of shared data, , and think fine, if name appropriately name suggests it's being shared, readability.
also, you're hiding internals of shared object , encapsulating them inside methods such process_item(), think important maintainability (because changes in internals of shared object don't have affect other object).
but i'm not sure way bootstrapping (i.e. initializing) shared object. have these 2 lines
if _data none: _data = ... which little surprising. didn't quite understand _data , defined. pipeline_data assigned none , never assigned else, i'm not sure meant there.
if possible, prefer see function called create_spiders() creates shared object, , creates different spiders 1 one, giving them reference shared object. makes logic clear.
however, in special case want shared object singleton, consider making static object in module name appropriately, maybe globals.py. , inside spider code see things like
import globals class spiderdata: def update(self): self.data.price = 200 globals.spiders_data_collector.process(self.data) inside module globals initialize object spiders_data_collector. think requires less code, , important maintainability.
Comments
Post a Comment