Skip to content

Latest commit

 

History

History
64 lines (58 loc) · 2.73 KB

NOTES.md

File metadata and controls

64 lines (58 loc) · 2.73 KB

Workshop Content

All slides - http://bengfort.com/presentations/discourses-in-language-processing/ Github - https://github.com/bbengfort

Introduction

A 10,000 foot view of NLP and NLTK

  • http://bengfort.com/presentations/discourses-in-language-processing/skyview/
  • google has been successful because they have had a huge training set from people clicking on the right 'answer'
  • what is required?
    1. Domain knowledge
    2. A corpus in the domain
  • the NLP pipleline: http://bengfort.com/presentations/discourses-in-language-processing/img/skyview/pipeline.png
    • today, we are ignoring the first two columns
    • morphology: the study of the forms of things, words in particular:
      • orthographic rules: puppy -> puppies
      • morphological rules: goose -> geese or fish
      • parsing task: stemming (lemmatization) and tokenization
        • tokens = symbols of language
        • words = tokens with meaning
        • stem = what you would look up in the dictionary
    • syntax = the study of the rules for formation of sentences
    • semantics = the study of meaning
  • Leveraging NLTK (https://github.com/nltk/nltk)
    • "NLP is perfect for MapReduce" (Hadoop)
    • major packages:
      • Utilities:
        • probability (Frequency and Conditional Distributions)
        • text, data, grammar, and corpus
        • tree (An impressive tree data structure and subclasses)
        • draw (Visualizations in Tkinter)
      • Language Processing
        • tokenize, stem (Morphological Processing, Segmentation)
        • collocations, models (NGram Analysis)
        • tag, chunk (Tagging and named entity Recognition)
        • parse (Syntactic Parsing)
        • sem (Semantic Analyses)
        • more: classification, clustering

Organizing text - The management of large bodies of natural language - copora

My own thoughts

  • things to look into