Reduce tweet fields returned by default #558

hadley · 2021-04-05T13:25:36Z

Current search_tweets() and friends returns a data frame with 73 columns:

 [1] "status_id"               "created_at"              "user_id"                
 [4] "screen_name"             "text"                    "source"                 
 [7] "display_text_width"      "reply_to_status_id"      "reply_to_user_id"       
[10] "reply_to_screen_name"    "is_quote"                "is_retweet"             
[13] "favorite_count"          "retweet_count"           "quote_count"            
[16] "reply_count"             "hashtags"                "symbols"                
[19] "urls_url"                "urls_t.co"               "urls_expanded_url"      
[22] "media_url"               "media_t.co"              "media_expanded_url"     
[25] "media_type"              "ext_media_url"           "ext_media_t.co"         
[28] "ext_media_expanded_url"  "ext_media_type"          "ext_alt_text"           
[31] "mentions_user_id"        "mentions_screen_name"    "lang"                   
[34] "quoted_status_id"        "quoted_text"             "quoted_created_at"      
[37] "quoted_source"           "quoted_favorite_count"   "quoted_retweet_count"   
[40] "quoted_user_id"          "quoted_screen_name"      "quoted_name"            
[43] "quoted_followers_count"  "quoted_friends_count"    "quoted_statuses_count"  
[46] "quoted_location"         "quoted_description"      "quoted_verified"        
[49] "retweet_status_id"       "retweet_text"            "retweet_created_at"     
[52] "retweet_source"          "retweet_favorite_count"  "retweet_retweet_count"  
[55] "retweet_user_id"         "retweet_screen_name"     "retweet_name"           
[58] "retweet_followers_count" "retweet_friends_count"   "retweet_statuses_count" 
[61] "retweet_location"        "retweet_description"     "retweet_verified"       
[64] "place_url"               "place_name"              "place_full_name"        
[67] "place_type"              "country"                 "country_code"           
[70] "geo_coords"              "coords_coords"           "bbox_coords"            
[73] "status_url"

I'd suggest that we return fewer more complicated columns by default, instead providing some helpers to access them when needed. For example, we could keep media_*, quoted_* and retweet_* in media, quoted and rtweet columns and provide helpers to expand them out when needed.

The text was updated successfully, but these errors were encountered:

hadley · 2021-04-05T13:43:40Z

IOTW I'm suggesting that the data frame unpacking that currently occurs in tweets_to_tbl_() should be post-poned until the user requests it.

simonheb · 2021-10-04T16:35:34Z

Is it intentional/temporary that since #572 get_timeline() and lists_statuses() do not return things like the screen_name of the sender anymore?

llrs · 2021-10-04T18:11:47Z

It is an error I didn't detect when I merged the PR. Thanks @simonheb for asking!! It is incorrectly processed and later on lost (but should be on user(search_tweets("bla")).

llrs · 2021-10-04T21:54:12Z

Sorry @simonheb I checked more about the issue and it turned out I used incorrect code, user is internal of rtweet and shouldn't be used externally.

The correct function data to retrieve the screen name is users_data which correctly returns all the information returned by the API about the user. This is what you should use and it is also documented on the search_tweets example.

simonheb · 2021-10-07T09:14:39Z

Ok thanks.

But this is just a quick fix, no? In the long run lists_statuses, etc. should also return user data, no?

llrs · 2021-10-07T12:49:55Z

They already return this data but it is on an attribute. It is not a quick fix, I had not to do anything here for this to work between the comments. I agree with Hadley that having 70 columns was not not practical.

At the moment I don't plan to change the columns or how the information is returned anytime soon.
Perhaps it needs it's own class and/or printing method. I might add some functions to access some of the nested lists within the object, but not going back to one big 73 wide column data.frame. I understand that this is one more breaking change but I think that in general it makes working with the output easier.

mkearney · 2021-10-20T17:48:10Z

These changes look great! It's a good idea [and more sustainable] to more closely mirror actual API data structures. I'm excited to see [and hopefully contribute again to] future changes in the pkg as well! Thank you @llrs and @hadley for all the hard but excellent work!

llrs · 2021-10-21T17:52:24Z

Hi @mkearney many thanks for your encouraging words. Sorry for the surprise when you installed the development version of the package and it broke your scripts. It was not my intention when I offered to help maintaining the package.

I am aware that changing the column names will break scripts and other packages, that's one of the reasons why it will take some more time until I think about sending the package to CRAN. Perhaps there will be one other breaking change before sending it to CRAN as we were considering renaming the functions. Additionally, there are still some bugs we have introduced I would like to fix and I want to make it easier to transition from 0.7.0 version to this one. Perhaps one of the ways might be adding some helpers to extract columns and rename them. All feedback is welcome, specially if we made something harder or you have other comments to improve the package.

While we've tried to mirror more the actual API data structure one of the main reasons of its success is due to the flattening of the data it does. There is still some work to do on that front as some of the columns have now a nested structure (and save_as_csv and write_as_csv, have not been adapted yet) but it is something important for the analysis we'll keep.

mkearney · 2021-10-22T15:23:36Z

You don't need to apologize, @llrs! You've been doing all of the [great] work here, so the last thing you should worry about is my old scripts working. It seems you've been very thoughtful about everything, so in the meantime, I'll try to give this (how to ease the transition for users) some thought and see if I can't help make this happen!

llrs · 2021-11-17T20:41:04Z

@mkearney I've sent you an email to the address listed on the description, not sure if you receive them...
In case you do not longer do, I wanted you to know that recently, some users asked for improvements on rtweet and suggested making a day-long rtweet hackathon to squash bugs and build in support for new API v2 of Twitter.

I suggested having the hackathon the 27th November or 4th of December. See this thread on the slack channel package-maintenance of rOpenSci. What do you think? Would you like to join the conversation there?

hadley mentioned this issue Apr 5, 2021

Truncate additional variables after fixed number of lines r-lib/pillar#263

Closed

This was referenced Apr 22, 2021

Simplify response parsing #572

Merged

lookup_users error with output for one specific twitter handle #574

Closed

get_timeline doesn't gather whole texts #575

Closed

llrs closed this as completed Aug 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce tweet fields returned by default #558

Reduce tweet fields returned by default #558

hadley commented Apr 5, 2021

hadley commented Apr 5, 2021

simonheb commented Oct 4, 2021

llrs commented Oct 4, 2021

llrs commented Oct 4, 2021

simonheb commented Oct 7, 2021

llrs commented Oct 7, 2021

mkearney commented Oct 20, 2021

llrs commented Oct 21, 2021

mkearney commented Oct 22, 2021

llrs commented Nov 17, 2021

Reduce tweet fields returned by default #558

Reduce tweet fields returned by default #558

Comments

hadley commented Apr 5, 2021

hadley commented Apr 5, 2021

simonheb commented Oct 4, 2021

llrs commented Oct 4, 2021

llrs commented Oct 4, 2021

simonheb commented Oct 7, 2021

llrs commented Oct 7, 2021

mkearney commented Oct 20, 2021

llrs commented Oct 21, 2021

mkearney commented Oct 22, 2021

llrs commented Nov 17, 2021