-
Notifications
You must be signed in to change notification settings - Fork 5
Homework 1
Jinho D. Choi edited this page Feb 2, 2015
·
1 revision
Hashtags found in social networks (e.g., Twitter, Instagram) do not allow white spaces such that all words are glued together, which make them hard to interpret. Your task is to write a program that segments word tokens in hashtags with white spaces. For instance, you program takes an input string (e.g., "#therealdeal") and generates a list of strings representing the most likely word sequence (e.g., ["the", "real", "deal"]).
- Create a class
LastnameSegment
(e.g.,ChoiSegment
) undersegment
. - Extend
AbstractSegment
. - Use either
Backoff
orInterpolation
for the language model. If you create your own language model different from the ones above, make sure it implementsILanguageModel
. - Estimate the n-gram likelihoods using the data in Quiz 1. Use either
LaplaceSmoothing
orDiscountSmoothing
. If you create your own smoothing, make sure it implementsISmoothing
- Make sure you handle unseen words (e.g., proper names, numbers).
- Feel free to use the code in
Segment
.
- Test your program using the following tags.
#2thingsthatdontmix
#10turnons
#90s
#100thingsaboutme
#alliwant
#annoyingbios
#aprilfoolsjokes
#areallygoodejob
#arelationshipisoverwhen
#artistseveryonelikesbutidont
- Tune your program so it produces correct answers if possible. I'll test your program with much more tags after submission.
- Briefly explain what kind of improvement you made in your program compared to the one we discussed in the class.
- Write a report that explains how your program handles (or don't handle) the above examples.
- Submit your report and all necessary Java files.
- Improve your program such that it handles long hangtags (e.g., the last example above) without taking so much time. Explain your approach in your report.
©2015 Emory University