jieba中文分词和词云

中文分词和词云

利用jiebawordcloud分别处理中文分词和词云制作

下面是根据github上的文档做的词云

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from os import path
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import os
import jieba.analyse

from wordcloud import WordCloud, STOPWORDS

# get data directory (using getcwd() is needed to support running example in generated IPython notebook)
d = path.dirname(__file__) if "__file__" in locals() else os.getcwd()

# Read the whole text.
text = open(path.join(d, 'sky.txt'),encoding='utf-8').read()
# 通过词性过滤
allow_pos = ('nr',)

# 进行jieba分词
content2 = jieba.analyse.extract_tags(text, topK=100, withWeight=False, allowPOS=allow_pos)

# 连接字符串
result = ' '.join(c for c in content2)

# read the mask image
# taken from
# https://raw.githubusercontent.com/amueller/word_cloud/master/examples/alice_mask.png
alice_mask = np.array(Image.open(path.join(d, "alice_mask.png")))

# 后面用来过滤使用的,但源文档中均是对英语词汇的过滤(可以自己写一个过滤文档或者使用jieba进行词性过滤)
stopwords = set(STOPWORDS)
stopwords.add("said")

# 注意需要添加字体(汉语)
wc = WordCloud(background_color="white", max_words=2000, mask=alice_mask,
stopwords=stopwords, contour_width=3, contour_color='steelblue',
font_path=r'C:\Windows\Fonts\simhei.ttf')

# generate word cloud
wc.generate(result)

# store to file
wc.to_file(path.join(d, "alice.png"))

# show
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.figure()
plt.imshow(alice_mask, cmap=plt.cm.gray, interpolation='bilinear')
plt.axis("off")
plt.show()

效果图

逆天邪神.png

在线词云生成工具

推荐站点

-------------end-------------
0%