forked from grromrell/nlp_exporations
-
Notifications
You must be signed in to change notification settings - Fork 0
rjhbrunt/base_nlp
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
base_nlp -------- This package is meant to hold some basic nlp tools, and over time compete with scikit-learn and nltk in speed (obviously not there yet). Tutorial -------- There are currently two modules: bag_of_words and cluster. Bag_of_words is undergoing experimentation with the tokenization for speed improvements, and cluster currently only contains DBSCAN. [bag_of_words] The purpose of the bag_of_words module (henceforth called BOW) is to do exactly what is sounds like, support bag of words representation of text documents. This representation is helpful for document similarity and clustering analysis. To read more see: http://bit.ly/1qlKOtp The BOW module is class focused, thus there is a definite work flow that can be followed to keep track of your information. The main class is called 'Bow' and this tutorial will focus on using it. 1. The first step is to import the module and start a Bow object. You have several options as to how you want to use this object, whether with a sql query, a directory or a pandas dataframe. We will use a a directory in this example. $ipython In [1]: from base_nlp import bag_of_word as nlp In [2]: bow = nlp.Bow(method = 'dir', topdir = '/work/username/file/' Here we passed a method parameter that told the object we want to import from text files, and the location of those text files. This will then create an object we can work this call 'bow.' 2. Next we will 'vectorize' our text. This will turn to text into 'tokens', which mean individual words, and then into a bag of words representation, using vectors. In [3]: bow.vectorizer(stem = True) Here use the 'stem' option, which will cut off the suffix of words for a better match (i.e.: Doorknob and Doorknobs will become the same word). This is useful for accuracy and speed, but can hurt readability of the text. Also notice that the vectorizer did not return the matrix or the vocab that we would expect. That is because they are stored inside the object, to get access to them call: In [4]: bow.word_matrix Out[4]: <1000x1000 sparse matrix ...> In [5]: bow.vocab Out[5]: {word1 : 1}... 3. Now sometimes the size of vectors are too large so we can use rule based measures to remove some values. In [6]: bow.trim_vocab(max_word_pct = .3, max_vocab_size = 100) This will remove all words that appear in more than 30% of all documents and restrict the maximum vocabulary size to 100 words. 4. The final thing we want to do is transform the documents into a tf-idf representation. This will weight the terms according to their relative importance, for more info see: http://bit.ly/1sIh6Re In [7]: bow.tfidf() And we are done! By calling bow.word_matrix, you can get acces to your newly transformed data and use it to cluster or any number of things! 5. There is further functionality in this module, such as gensim compatibil- ity and csv writers, and these functions are documented in the code. ---------- Short Term ---------- 1. Investigate stemmer options, particularly Prter and Lovins --------- Long Term --------- 1. N-gram support 2. Include tools that can take advantage of bag of words approach, like topic modelling and latent semantic analysis 3. Include native support for many different distance measures (or just use scipy?) 4. Cython? 5. More advanced tokenization methods (Penn Treebank, Stanford)
About
Basic NLP tools in Python.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published