SnowNLP 使用 jieba 分词替代自带分词方式

Quan

July 22, 2021

5768views

No comments

1469 words

教程

## 前言

SnowNLP自带的分词方式并不好用，使用效果没有jieba分词好，网上找到的很多替换方法都是要去改SnowNLP源码，这种方式很不方便。下面是我不改SnowNLP源码替换方式。

## 正文

直接上代码吧！

```python
import jieba  
from snownlp import normal
def test():
	def handle(self, doc):
		#设置jieba自定义词库
		jieba.load_userdict('mydict.txt')
		stop_words=['xxx'] # 停用词
		words = jieba.lcut(doc)  ##原本使用的是snownlp自带的seg分词功能，words = seg.seg(doc) 替换为jieba.lcut
		words = normal.filter_stop(words)  ##补充停用词，mormal文件夹中的stopword.txt
		words = [w for w in words if w not in stop_words] # 去除自己的停用词
		return words
	sentiment.Sentiment.handle=handle # 重写handle，用结巴分词
	string='今天xxx战队打野真的很菜鸡，迷之走位，几次大招放空,辅助玩的也菜鸡。xxx战队今天失败了!'
	sent = sentiment.Sentiment()
	words_list=sentiment.Sentiment.handle(sent,string)
	print(words_list)
	print(SnowNLP(string).sentiments)
```

因为Python的函数也是对象，所以可以通过对函数重新赋值来达到重写函数的目的，但注意，这是全局修改！其他地方调用也会是修改后的，不想出现这种情况的话，就修改实例化后对象的函数就行。

修改前后结果对比：

```powershell
['菜', '空', '辅助']
['今天xxx战队打野真的很菜鸡', '迷之走位', '几次大招放空,辅助玩的也菜鸡']
0.7125215958596806
#########################################
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\****\AppData\Local\Temp\jieba.cache
Loading model cost 0.751 seconds.
Prefix dict has been built successfully.
['战队', '打野', '真的', '很', '菜鸡', '迷', '走位', '几次', '大招', '放空', '辅助', '玩', '菜鸡', '战队', '失败']
0.5399721041336288
[('战队', 5.00575331636), ('菜鸡', 4.78190700116), ('走位', 2.39095350058)]
```

还可以通过修改 handle 函数的内容来优化分词效果。

Last modification：July 22nd, 2021 at 05:09 pm

If you think my article is useful to you, please feel free to appreciate

SnowNLP 使用 jieba 分词替代自带分词方式