Homework 1

Word Segmentation

Hashtags found in social networks (e.g., Twitter, Instagram) do not allow white spaces such that all words are glued together, which make them hard to interpret. Your task is to write a program that segments word tokens in hashtags with white spaces. For instance, you program takes an input string (e.g., "#therealdeal") and generates a list of strings representing the most likely word sequence (e.g., ["the", "real", "deal"]).

Task 1

Create a class LastnameSegment (e.g., ChoiSegment) under segment.
Extend AbstractSegment.
Use either Backoff or Interpolation for the language model. If you create your own language model different from the ones above, make sure it implements ILanguageModel.
Estimate the n-gram likelihoods using the data in Quiz 1. Use either LaplaceSmoothing or DiscountSmoothing. If you create your own smoothing, make sure it implements ISmoothing
Make sure you handle unseen words (e.g., proper names, numbers).
Feel free to use the code in Segment.

Task 2

Test your program using the following tags.

#2thingsthatdontmix
#10turnons
#90s
#100thingsaboutme
#alliwant
#annoyingbios
#aprilfoolsjokes
#areallygoodejob
#arelationshipisoverwhen
#artistseveryonelikesbutidont

Tune your program so it produces correct answers if possible. I'll test your program with much more tags after submission.

Task 3

Briefly explain what kind of improvement you made in your program compared to the one we discussed in the class.
Write a report that explains how your program handles (or don't handle) the above examples.
Submit your report and all necessary Java files.

Extra Credit

Improve your program such that it handles long hangtags (e.g., the last example above) without taking so much time. Explain your approach in your report.

Artificial Intelligence

Syllabus.
Schedule.

Instructor

Jinho D. Choi

Emory University

Provide feedback

Saved searches

Use saved searches to filter your results more quickly