TidierText.jl

What is TidierText.jl

TidierText.jl is a 100% Julia implementation of the R tidytext package. The purpose of the package is to make it easy analyze text data using DataFrames.

An extensive guide to tidy text analysis by Julia Silge and David Robinson is available here: https://www.tidytextmining.com/.

Installation

For the development version:

using Pkg
Pkg.add(url="https://github.com/TidierOrg/TidierText.jl")

What functions does TidierText.jl support?

@bind_tf_idf()
@unnest_tokens()
@unnest_regex()
@unnest_characters()
@unnest_ngrams()
get_stopwords()
tidy()
nma_words

How does the package work?

Let's load the package and read in the UCLA Fall 2018 course dataset.

using TidierData
using TidierText

using CSV

courses = CSV.read(download("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/ucla_f18.csv"), DataFrame)

What are the course names?

@chain courses begin
  @select(id = rownames, course)
  @slice(1:10)
end

10×2 DataFrame
 Row │ id     course                            
     │ Int64  String                            
─────┼──────────────────────────────────────────
   1 │     1  Leadership Laboratory
   2 │     2  Heritage and Values
   3 │     3  Team and Leadership Fundamentals
   4 │     4  Air Force Leadership Studies
   5 │     5  National Security Affairs/Prepar…
   6 │     6  Introduction to Black Studies
   7 │     7  African American Musical Heritage
   8 │     8  UCLA Centennial Initiative: Arth…
   9 │     9  UCLA Centennial Initiative: Soci…
  10 │    10  Student Research Program

Let's tokenize the course names and convert them to lowercase.

tokens = @chain courses begin
  @select(id = rownames, course)
  @slice(1:10)
  @unnest_tokens(word, course, to_lower = true)
end;

@chain tokens @slice(1:10)

10×2 DataFrame
 Row │ id     word         
     │ Int64  SubStrin…    
─────┼─────────────────────
   1 │     1  leadership
   2 │     1  laboratory
   3 │     2  heritage
   4 │     2  and
   5 │     2  values
   6 │     3  team
   7 │     3  and
   8 │     3  leadership
   9 │     3  fundamentals
  10 │     4  air

Let's add the term frequency, inverse document frequency, and the tf-idf.

@chain tokens begin
  @count(id, word)
  @bind_tf_idf(word, id, n)
  @slice(1:10)
end

10×6 DataFrame
 Row │ id     word          n      tf        idf       tf_idf   
     │ Int64  SubStrin…     Int64  Float64   Float64   Float64  
─────┼──────────────────────────────────────────────────────────
   1 │     1  leadership        1  0.5       1.20397   0.601986
   2 │     1  laboratory        1  0.5       2.30259   1.15129
   3 │     2  heritage          1  0.333333  1.60944   0.536479
   4 │     2  and               1  0.333333  0.916291  0.30543
   5 │     2  values            1  0.333333  2.30259   0.767528
   6 │     3  team              1  0.25      2.30259   0.575646
   7 │     3  and               1  0.25      0.916291  0.229073
   8 │     3  leadership        1  0.25      1.20397   0.300993
   9 │     3  fundamentals      1  0.25      2.30259   0.575646
  10 │     4  air               1  0.25      2.30259   0.575646

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
docs/src		docs/src
src		src
test		test
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TidierText.jl

What is TidierText.jl

Installation

What functions does TidierText.jl support?

How does the package work?

Let's load the package and read in the UCLA Fall 2018 course dataset.

What are the course names?

Let's tokenize the course names and convert them to lowercase.

Let's add the term frequency, inverse document frequency, and the tf-idf.

About

Releases 2

Packages

Contributors 2

Languages

License

TidierOrg/TidierText.jl

Folders and files

Latest commit

History

Repository files navigation

TidierText.jl

What is TidierText.jl

Installation

What functions does TidierText.jl support?

How does the package work?

Let's load the package and read in the UCLA Fall 2018 course dataset.

What are the course names?

Let's tokenize the course names and convert them to lowercase.

Let's add the term frequency, inverse document frequency, and the tf-idf.

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages