Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible cleaning function for reimport #86

Open
trinker opened this issue Sep 24, 2018 · 5 comments
Open

Possible cleaning function for reimport #86

trinker opened this issue Sep 24, 2018 · 5 comments

Comments

@trinker
Copy link
Owner

trinker commented Sep 24, 2018

This would belong in textclean but things that are abbreviated forms like fan vs fanatic:

> sentiment(c("He's a nice guy", "can be a jerk. I'm not a fan."))
   element_id sentence_id word_count sentiment
1:          1           1          4      0.25
2:          2           1          4     -0.25
3:          2           2          4      0.00
> sentiment(c("He's a nice guy", "can be a jerk. I'm not a fanatic."))
   element_id sentence_id word_count sentiment
1:          1           1          4      0.25
2:          2           1          4     -0.25
3:          2           2          4      0.25

could be replaced:

WIP

fix_fan <- function(x, ...){
    gsub(paste0(pro_replacements, '(\\b[Ff]an)(\\b|s?)'), '\\1\\2atic\\3', x, perl = TRUE, ignore.case = TRUE)
}

pronouns <- c("s?he( i|')s", "(you|they|we)( a|')re", "I( a|')m")
pro_replacements <- paste0('(', paste(paste0('(', pronouns, ')'), collapse = '|'), ')')

fix_fan('He\'s the bigest fan I know.')
@trinker
Copy link
Owner Author

trinker commented Sep 24, 2018

Would be in textclean but rexported by sentimentr

@trinker
Copy link
Owner Author

trinker commented Sep 24, 2018

inputs <- c(
    "He's the bigest fan I know.",
    "I am a huge fan of his.",
    "I know she has lots of fans in his club",
    "I was cold and turned on the fan",
    "An air conditioner is better than 2 fans at cooling.",
    "I'm a really gigantic and humble fan of the book."
)

fix_fan <- function(x, pronoun.distance = 20, ...){

    gsub(
        paste0("((?:s?he(?: i| ha|')s|(?:you|they|we)(?: a|')re|I(?: a|')m).{1,", pronoun.distance, "})\\b(fan)(s?)\\b"), 
        '\\1\\2atic\\3', 
       x, 
       ignore.case = TRUE
    )

}


fix_fan2 <- function(x, pronoun.distance = 20, ...){

    stringi::stri_replace_all_regex(
        x,
        paste0("((?:s?he(?: i| ha|')s|(?:you|they|we)(?: a|')re|I(?: a|')m).{1,", pronoun.distance, "})\\b(fan)(s?)\\b"),  
        '$1$2atic$3', 
        opts_regex = stringi::stri_opts_regex(case_insensitive=TRUE)
    )

}

fix_fan(inputs)
fix_fan(inputs, 30)
fix_fan2(inputs)

@trinker
Copy link
Owner Author

trinker commented Sep 24, 2018

Other examples include:

tibble::tribble(
  ~short,   ~long,
  "fan",    "fanatic",
  "emo",    "emotionally disturbed"
)

@trinker
Copy link
Owner Author

trinker commented Sep 24, 2018

Note these are called shortenings:

https://en.oxforddictionaries.com/spelling/shortenings

and more formally: https://en.wikipedia.org/wiki/Clipping_(morphology)

@trinker
Copy link
Owner Author

trinker commented Sep 24, 2018

y <- c('tazer', 'emo', 'typo', 'quake', 'scram')
lexicon::hash_sentiment_jockers_rinker[y]

Consider adding polarity table directly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant