python 3.x - Can't avoid stop words in tokens list -

- May 15, 2015

i'm normalizing text wiki , 1 if task delete stopwords(item) text tokens. can't it, more exact, can't avoid of items.

code:

# coding: utf8 import os  nltk import corpus, word_tokenize, freqdist, conditionalfreqdist import win_unicode_console  win_unicode_console.enable()  stop_words_plus = ['il', 'la'] text_tags = ['doc', 'https', 'br', 'clear', 'all'] it_sw = corpus.stopwords.words('italian') + text_tags + stop_words_plus it_path = os.listdir('c:\\users\\1\\projects\\i') lom_path = 'c:\\users\\1\\projects\\l' it_corpora = [] lom_corpora = []  def normalize(raw_text):     tokens = word_tokenize(raw_text)     norm_tokens = []     token in tokens:         if token not in it_sw , token.isalpha() , len(token) > 1:             token = token.lower()             norm_tokens.append(token)     return norm_tokens  folder_name in it_path:     path_to_files = 'c:\\users\\1\\projects\\i\\%s' % (folder_name)     files_list = os.listdir(path_to_files)     file_name in files_list:         file_path = path_to_files + '\\' + file_name         text_file = open(file_path, encoding='utf8')         raw_text = text_file.read()         norm_tokens = normalize(raw_text)         it_corpora += norm_tokens  print(freqdist(it_corpora).most_common(10))

output:

[('anni', 1140), ('il', 657), ('la', 523), ('gli', 287), ('parte', 276), ('stato', 276), ('due', 269), ('citta', 254), ( 'nel', 248), ('decennio', 242)]

as can see, need avoid words 'il' , 'la', add them list(it_sw) , there are(i've checked). in func normalize try avoid them `if token not in it_sw, doesn't work , have no idea what's wrong.

you convert token lower case after finding not in it_sw. possible of tokens have upper case characters? in case adjust loop slightly:

for token in tokens:     token = token.lower()     if token not in it_sw , token.isalpha() , len(token) > 1:         norm_tokens.append(token)

by way, i'm not sure if performance of code important, if might better performance checking presence of tokens in set instead of list, change definition of it_sw to:

it_sw = set(corpus.stopwords.words('italian') + text_tags + stop_words_plus)

you change it_corpora set, require few more small changes.

Search This Blog

If code

python 3.x - Can't avoid stop words in tokens list -

Comments

Post a Comment

Popular posts from this blog

multithreading - Exception in Application constructor -

React Native allow user to reorder elements in a scrollview list -

windows - CertCreateCertificateContext returns CRYPT_E_ASN1_BADTAG / 8009310b -