Visualize as Word Cloud for Chinese keywords

Sep 15, 2020

—

Data visualization is amazing. For example, by looking at the following word cloud figure full of Chinese keywords, you can easily learn what happened in last week’s trend news. Let’s generate a word cloud like this.

Don’t understand the language is not a big deal. If your written language is based on the Latin alphabet(or other languages have space between words), skip the section tokenization.

Why I want to have a Word Cloud

Recently, I set up a web-based RSS client for retrieving and organizing everyday news. I used TinyTinyRSS, or TT-Rss, a popular RSS client that is friendly to docker. Thanks to developer HenryQW, a well-written Nginx-based docker configuration is already available in the docker hub. With more feeds being added, I found some feeds do not need to be checked daily. Thus I was thinking of creating a script to automatically list all keywords that appear in a last period and generate a heat map kind of figure of it.

Before you go further, I’ll tell you all my settings to give readers a general overview.

My first step is to read all text-based information from TTRSS’s PostgreSQL database. With information, I used a Chinese-NLP library, Jieba, to extract all keywords with the frequency of their occurrence. By using WordCloud, a Python library, a word cloud figure is generated and presented. More details will be discussed in later sections.

Keyword readiness: Get RSS feeds’ text

My first thought is generating a keyword heat map for economic news of the last week. Since this blog post is more skewed to Chinese tokenization and draws the word cloud figure, I will leave my code here just in case. The SQL connector I used is psycopg2, an easy-to-use PostgreSQL library.

def __init__(self):
    self.dbe = psycopg2.connect(
        host=DB_HOST, port=DB_PORT, database=DB_NAME, user=DB_USER, password=DB_PASS)

def get_1w_of_feed_byid(self, id=1) -> list:
    cur = self.dbe.cursor()
    cur.execute('SELECT content FROM public.ttrss_entries \
        where date_updated > now() - interval \'1 week\' AND id in ( \
        select int_id from DB_TABLE_NAME \
        where feed_id=' + str(id) + ' \
        ) \
        ORDER BY id ASC '
        )
    rows = cur.fetchall()
    return rows

Most arguments are intuitive and easy to understand. The only exception is the argument of the function get_1w_of_feed_byid. This id is the feed index of my subscriptions.

Tokenize with frequency

Two popular tokenization libraries were used, and I chose Jieba after a few comparisons. Before cutting the sentence, we first need to remove all punctuation marks.

def remove_biaodian(text: str) -> str:
    punct = set(u''':!),.:;?]}¢'"、。〉》」』】〕〗〞︰︱︳﹐､﹒
                ﹔﹕﹖﹗﹚﹜﹞！），．：；？｜｝︴︶︸︺︼︾﹀﹂﹄﹏､～￠
                々‖•·ˇˉ―--′’”([{£¥'"‵〈《「『【〔〖（［｛￡￥〝︵︷︹︻
                ︽︿﹁﹃﹙﹛﹝（｛“‘-—_…''')
    ret = ""
    for x in text:
        if x in punct:
            ret += ''
        else:
            ret += x
    return ret

After we have an all-character string, we can call Jieba. By using the function jieba.posseg.cut with or without a paddle, we can have a word list and their “part of speech”. As you can see in the following code, I also did two more works.

First, in the if statement, I only kept all nouns with some categories. Category abbreviations such as “nr” and “ns” represent different “parts of speech”, I attached the categories I used in the following table. For more details, you can find this link.

The second work is only keeping words with lengths longer than 2 characters. In Chinese, there’s no space between words such as in Latin writing systems. Since then, some single-character words such as conjunction words are easy to be misrecognized as specialty nouns. This misrecognition will cause more single-character being regarded as specialty nouns. I am not able to improve the NLP method, so I used an easy way to fix this by removing any words less than 2 characters.

import jieba.posseg as pseg

def get_noun_jieba(self, content: str) -> list:
    content = remove_biaodian(content)
    words = pseg.cut(content)   # Invoking jieba.posseg.cut function 

    ret = []
    for word, flag in words:
        # print(word, flag)
        if flag in ['nr', 'ns', 'nt', 'nw', 'nz', 'PER', 'ORG', 'x']:   # LOC
            ret.append(word)
    return [remove_biaodian(i) for i in ret if i.strip() != "" and len(remove_biaodian(i.strip())) >= 2]

Word category names and abbreviations

Abbreviation	Category name/ Part of speech
nr	People name noun
ns	Location name noun
nt	Organization name noun
nw	Arts work noun
nz	Other noun
PER	People name noun
ORG	Location name noun
x	Non-morpheme word

With all words extracted, we can easily calculate their frequencies. After this, we can using the following line of code to print a sorted result to verify correctness.

noun = seg.get_noun_jieba(test_content)
# ... Calculate frequency of above word list ...
print(sorted(a_dict.items(), key=lambda x: x[1]))

Draw word cloud

With a keyword and frequency dictionary(data structure), we can just call built-in functions from Wordcloud library to generate the figure.

First, we need to initialize an instance of wordcloud class. As you can see in my code, I set it with 6 parameters. Width and Height of the canvas, the maximum amount of words used to generate the figure, the font of words, the background color, and the margin between any two words.

After having the instance, we call the function generate_from_frequencies and pass the keyword dictionary to it. The return value of this function is a bitmap image, which we can use Matplotlib to plot it to your screen.

I tested my plot on ubuntu-subsystem on Windows 10, unfortunately, Matplotlib under subsystem depends on the x11 window manager, and it’s not default available on Windows. We need to install an x11 manager to support this. Xming is the one I used.

from wordcloud import WordCloud
import matplotlib.pyplot as plt

font_path = "./font/haipai.ttf"
output_path = "./font/out.png"

def show_figure_with_frequency(keywords: dict):
    wc = WordCloud(width=828, height=1792, max_words=200, font_path=font_path,
                   background_color="white", margin=1).generate_from_frequencies(keywords)
    plt.imshow(wc)
    plt.axis('off')
    plt.show()

If everything work fine, a word cloud figure will show up in a new window. My version looks like this.

This generated word cloud figure reflects the most popular economy news keyword in the week started 06-28-2020.

The two largest words in the figure are “新冠” and “新冠病毒”, both meaning “Covid-19” (This figure was in the week of the second covid spur in Beijing, China). The size of the image fits my phone screen and I can use an app to automatically sync it to my phone’s wallpaper. However, in this image, too many location nouns are presented. This will be something I can make progress on in the future.

Visualization

Visualize as Word Cloud for Chinese keywords

Why I want to have a Word Cloud

Keyword readiness: Get RSS feeds’ text

Tokenize with frequency

Draw word cloud

Comments

Leave a Reply Cancel reply