TwiLoc investigates the feasibility of geographically locating Twitter users based solely on tweet content. We are trying to locate a user using their tweet content by understanding the dialect differences across geographies through deep learning techniques. We are not using any other external information to locate the user. This project provides an approach to augment existing systems that locate users.
Requires Python 3.x.
Here's is the list of libraries required for this project
GloVe is used for obtaining vector representations for words.
-
GeoText - Geo-tagged Microblog Corpus is the primary dataset for TwiLoc. All the results and hyperparameter tunings are based on this dataset.
-
Accuracy can be enhanced further by using massive datasets like UTGeo2011 can also be used to train.
Reverse geocoding can be done using services provided by MapQuest.
Model | Accuracy (%) |
---|---|
CNN | 57.43 |
GRU | 56.35 |
LSTM | 55.54 |
MLP | 50.59 |
Note: Please read the report for more detailed information regarding the experiment's result.
- Eisenstein J., O'Connor B., Smith N A., Xing E P. 2010. A Latent Variable Model for Geographic Lexical Variation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
- Liu J., Inkpen D. 2015. Estimating User Location in Social Media with Stacked Denoising Autoencoders. Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing.
- Yin W., Kann K., Yu M., Hinrich S. 2017. Comparative Study of CNN and RNN for Natural Language Processing. arXiv:1702.01923.