A tool to upskill with friends
-
Check out the web app Text Ascent
-
View the Presentation of Text Ascent, PDF
I've often found myself reading an article, say on data science, and wondering, where can I read something simpler on this topic? I realized I wasn't the only one when a friend posted a similar question on LinkedIn. She asked how to find articles in a specific range between most simple and most complex. I realized we don't have an easy system for that type of search besides manually reading for a good fit.
Building on my interests in web search, I created Text Ascent, a web app that uses unsupervised ML to help users discover content based on text complexity. I hope Text Ascent can be one tool used to address searching for content along all the stages of our learning journeys. Central to the goals I have for Text Ascent is for it to make niche topics of interest between people more accessible.
Photo Credit: Ted Bryan Yu, Unsplash
I used Wikipedia-API, a python wrapper for Wikipedia's API to gather article titles on topics ranging from art to science. Then I ran a data gathering function (scrape_to_mongodb.py) that took those titles and scraped 11k+ articles for summaries, full text, and urls into a MongoDB database. I excluded articles that had full text less than 300 words because there are entries in Wikipedia like 'music file' that did not serve my model's purpose.
Data Collection Notebook & Data Exploration Notebook
The content returned from the Wikipedia-API wrapper did not require further cleaning. I did need to make sure that when the content was displayed on the web app that the html was read as JSON to avoid carriage returns displaying to the user. I graded the full text of each document using the textstat package's Flesch-Kincaid Grade.
These files are saved in an AWS S3 bucket to allow make the web app accessible.
The current model uses cosine distance between the top 20 features of importance in corpus vectors and user input vectors to return similar content from the library to user input. The model features were created with TF-IDF vectorizer. TF-IDF vectorizer splits the words in the corpus documents, removes stop words, and computes a term frequency for each word in each document, adjusted for how frequently the word appears in the corpus. In other words, uncommon words are given more weight than commonly used words.
- Get a list of documents of interest and format into a dataframe like
clean_df
. Get text difficulty scores using TextStat. My example on AWS S3: clean_df - Fit your corpus to your vectorizer (learns vocabulary and idf from training set), which is the text series in your df My example on AWS S3: vectorizer
- Use a vectorizer transform function (transforms documents to document-term matrix) to create your corpus vectors My example on AWS S3: corpus vectors
- Clone this repository
- In the
traverse_flask
directory, create an empty subdirectory nameddata
. - Implement the flask app by running flask in
traverse_flask
in the terminal with$ export FLASK_APP=app $ flask run
. This flaskapp.py
takes in functions fromfunctions.py
. Adjust the functions to change the data pipeline on the backend. Adjust the brython in thestatic/templates/index.html
to change the way data is reflected to the user.
This product is successful if users are able to discover content related to what they were already reading that is of a different reading difficulty. User satisfaction, repeat usage, web app traffic, and sharing of the app are the metrics I am using to evaluate Text Ascent's success. I evaluated 4 models before going with the model deployed on the web app:
- Model 1: Used TextStat, Gensim, and Spacy.
- Model 2: Used Latent Dirichlet Allocation (LDA) topic modeling with 10 topics, then sorts user content into a topic.
- Model 3: Used TextStat and TF-IDF Vectorizer with 2000 dimensions.
- Model 4: Used TextStat and TF-IDF Vectorizer with top 20 features.
Each iteration was done to so the resulting content was more similar to the user input content.
Future Modeling: I would also like to compare a pre-trained neural network to my current TFIDF Vectorization to see if the quality of returned content improves. Improvement would be measured through user feedback in a simple manual grading system to be added to the web app.
Text Ascent has been deployed as a flask-enabled web app traverse.sherzyang.com on an EC2 instance. The app uses brython to interact between python functions and html. Below are two images from the web app. Given any user input text, the model will output related articles from the library with links in the title to full length articles. Users can scroll or traverse from simpler content to more complex content and the table will update accordingly.
As part of my interests in search and our new world of one-shot answers--thank you Alexa, Siri and Google Home--I plan on deploying Text Ascent as an Amazon Alexa skill. The skill will allow a user to "scroll" or "traverse" along a gradient of simpler to more complex summaries on a topic just like telling Alexa to play a song louder or softer. I believe creating options in content will expand us beyond the world of one-shot answers in a positive way.
Additionally, I am eager to grow the corpus to include books from Project Guttenberg and beyond. If you have some content you'd like to see added to the current library of wikipedia articles please send me a message on LinkedIn. I've seen several web extensions that grade a book's reading difficulty on Amazon or Goodreads (Read Up is a great one). Those products inspire me to develop a corpus-free search functionality for Text Ascent in the future. I envision Text Ascent becoming much more useful when it can return Google or Bing web search API enabled content.