Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于预处理 #4

Open
doulalala opened this issue May 11, 2019 · 2 comments
Open

关于预处理 #4

doulalala opened this issue May 11, 2019 · 2 comments

Comments

@doulalala
Copy link

大哥,请问 token_and_save_to_file.py 运行时报错 TypeError: can't pickle _thread.RLock objects 该怎么解决呀。我把 data = Pool().map(jieba.lcut, data)注释掉才没有报错。可是这样就不能完成分词了。

@yahuuu
Copy link

yahuuu commented Jan 6, 2020

遇到同样问题了。 @hrwhisper 能来看下吗??

@rainmaple
Copy link

可以将其改成单线程的:

if __name__ == '__main__':
    data, target = read_train_data()
    #data = Pool().map(jieba.lcut, data)
    data2words = []
    for words in data:
        temp = jieba.cut(words)
        data2words.append(temp)
    save_tokenlization_result(data2words, target)

    with codecs.open('./data/tags_token_results', 'r', 'utf-8') as f:
        data = [line.strip().split() for line in f.read().split('\n')]
        if not data[-1]: data.pop()
        t = [Counter(d) for d in data]  # 每一行为一个短信, 值就是TF
        v = DictVectorizer()
        t = v.fit_transform(t)  # 稀疏矩阵表示sparse matrix,词编好号
        TrainData.save(t)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants