python 3.x - Can't avoid stop words in tokens list -
i'm normalizing text wiki , 1 if task delete stopwords(item) text tokens. can't it, more exact, can't avoid of items.
code:
# coding: utf8 import os nltk import corpus, word_tokenize, freqdist, conditionalfreqdist import win_unicode_console win_unicode_console.enable() stop_words_plus = ['il', 'la'] text_tags = ['doc', 'https', 'br', 'clear', 'all'] it_sw = corpus.stopwords.words('italian') + text_tags + stop_words_plus it_path = os.listdir('c:\\users\\1\\projects\\i') lom_path = 'c:\\users\\1\\projects\\l' it_corpora = [] lom_corpora = [] def normalize(raw_text): tokens = word_tokenize(raw_text) norm_tokens = [] token in tokens: if token not in it_sw , token.isalpha() , len(token) > 1: token = token.lower() norm_tokens.append(token) return norm_tokens folder_name in it_path: path_to_files = 'c:\\users\\1\\projects\\i\\%s' % (folder_name) files_list = os.listdir(path_to_files) file_name in files_list: file_path = path_to_files + '\\' + file_name text_file = open(file_path, encoding='utf8') raw_text = text_file.read() norm_tokens = normalize(raw_text) it_corpora += norm_tokens print(freqdist(it_corpora).most_common(10)) output:
[('anni', 1140), ('il', 657), ('la', 523), ('gli', 287), ('parte', 276), ('stato', 276), ('due', 269), ('citta', 254), ( 'nel', 248), ('decennio', 242)] as can see, need avoid words 'il' , 'la', add them list(it_sw) , there are(i've checked). in func normalize try avoid them `if token not in it_sw, doesn't work , have no idea what's wrong.
you convert token lower case after finding not in it_sw. possible of tokens have upper case characters? in case adjust loop slightly:
for token in tokens: token = token.lower() if token not in it_sw , token.isalpha() , len(token) > 1: norm_tokens.append(token) by way, i'm not sure if performance of code important, if might better performance checking presence of tokens in set instead of list, change definition of it_sw to:
it_sw = set(corpus.stopwords.words('italian') + text_tags + stop_words_plus) you change it_corpora set, require few more small changes.
Comments
Post a Comment