Skip to content
Jinho D. Choi edited this page Feb 2, 2015 · 1 revision

Word Segmentation

Hashtags found in social networks (e.g., Twitter, Instagram) do not allow white spaces such that all words are glued together, which make them hard to interpret. Your task is to write a program that segments word tokens in hashtags with white spaces. For instance, you program takes an input string (e.g., "#therealdeal") and generates a list of strings representing the most likely word sequence (e.g., ["the", "real", "deal"]).

Task 1

Task 2

  • Test your program using the following tags.
#2thingsthatdontmix
#10turnons
#90s
#100thingsaboutme
#alliwant
#annoyingbios
#aprilfoolsjokes
#areallygoodejob
#arelationshipisoverwhen
#artistseveryonelikesbutidont
  • Tune your program so it produces correct answers if possible. I'll test your program with much more tags after submission.

Task 3

  • Briefly explain what kind of improvement you made in your program compared to the one we discussed in the class.
  • Write a report that explains how your program handles (or don't handle) the above examples.
  • Submit your report and all necessary Java files.

Extra Credit

  • Improve your program such that it handles long hangtags (e.g., the last example above) without taking so much time. Explain your approach in your report.

Artificial Intelligence

Instructor


Emory University

Clone this wiki locally