Welcome to our bilingual social media corpus collected from Twitter and Weibo in March 2020. English tweets and Chinese Weibo posts related to COVID-19 are the focus of this corpus.
Please follow these instructions in order to use our web interface.
Downloads
The following files need to be downloaded from the repository and saved to the same directory:
keyword_search_backend.py
: the backend python code.keyword_search_frontend.css
: the style sheets for displaying the html in a certain way.keyword_search_frontend.html
: the html code that displays is used by the browser to render the html tags.keyword_search_frontend.js
: the java script code to join the html to the backend Python code.SourceHanSansSC-Regular.otf
: dependent file for generating word clouds.
It is important to save them all in the same ../user_dir
directory!
Dependent packages
Some dependent packages should already be installed on your computer (e.g., nltk, pandas, numpy, random, io, and http). Please ensure the following packages are also installed:
- jieba:
pip install jieba
- hanzidentifier:
pip install hanzidentifier
- wordcloud:
pip install wordcloud
- matplotlib:
pip install matplotlib
- PIL:
pip install Pillow
- requests:
pip install requests
Navigate to the ../user_dir
directory in terminal or command-line interface and run the code python keyword_search_backend.py
. Wait until you see "Ready for query" as shown; this means the web interface is ready to load in the browser!
Now, open a browser in your system (preferably Chrome) and paste localhost:9999
in the address bar. You will be able to see the webpage like below:
Troubleshooting
If you encounter a problem with the port, try changing the port number 9999
in line 308 of keyword_search_backend.py
to any unused port in your local system.
Search by keyword
- Enter the keyword in English or Chinese in the
text box
and press theSubmit
button to see a list of returned posts and a word cloud that matches the query.
Please note:
- Keywords must be one word with no spaces. For example, a search for "virus\ " will not return any posts. However, if the input is "virus", a total of 3,052 posts will be found and a random sample of 10 from the search results will be displayed. One Chinese word can be multiple characters. The same no-space rule applies; a search like "学校" (meaning school) will return 59 posts but "学\ 校" will not.
- Do not press the
enter
key because the query will not send. ClickSubmit
only. - There may be some lag after submitting the keyword due to the word cloud generation, please be patient.
- To see different social media posts if more than 10 were found, click the
Submit
button again.
Suggested searches:
- Try "asldkjb". You will be notified this keyword is not in our corpus.
- Try "test", "testing" and "TeStEd". The returned results are the same because raw data was tokenized, lemmatized and lower-cased.
- Try "what're", "wat" and "what". Different messages will be displayed according to the length of the returned results.
- Try an emoji... however, nothing will be returned because, sadly, the word cloud package does not handle emojis.
- Try "李" (a surname) and "李文亮" (the full name of the late Wuhan doctor). Jieba is built for text segmentation and not named entity recognition. Even though "李文亮" is quite common in our corpus, technically it is not in our dictionary. A customized vocabulary dictionary needs to be created and then merged with the existing Jieba dictionary in order to capture names like "李文亮".
- Try "学校" (simplified Chinese for school) and "學校" (traditional Chinese). And then try "学" (simplified Chinese for to learn) and "學" (traditional Chinese). The Chinese data was not normalized for form so results varied. The vocabulary is built on Jieba's text segmentation of the raw post.
Search by annoation
- First, select the annotation category to query in the
drop-down
menu. Then, select the language and clickSubmit
to see social media posts that matches the query. If more than 10 posts were found, click theSubmit
button again to see a different batch of posts.
Please note:
- Do not press the
enter
key because the query will not send. ClickSubmit
only.