Skip to content
This repository has been archived by the owner on Nov 10, 2024. It is now read-only.

Reduce tweet fields returned by default #558

Closed
hadley opened this issue Apr 5, 2021 · 10 comments
Closed

Reduce tweet fields returned by default #558

hadley opened this issue Apr 5, 2021 · 10 comments

Comments

@hadley
Copy link
Collaborator

hadley commented Apr 5, 2021

Current search_tweets() and friends returns a data frame with 73 columns:

 [1] "status_id"               "created_at"              "user_id"                
 [4] "screen_name"             "text"                    "source"                 
 [7] "display_text_width"      "reply_to_status_id"      "reply_to_user_id"       
[10] "reply_to_screen_name"    "is_quote"                "is_retweet"             
[13] "favorite_count"          "retweet_count"           "quote_count"            
[16] "reply_count"             "hashtags"                "symbols"                
[19] "urls_url"                "urls_t.co"               "urls_expanded_url"      
[22] "media_url"               "media_t.co"              "media_expanded_url"     
[25] "media_type"              "ext_media_url"           "ext_media_t.co"         
[28] "ext_media_expanded_url"  "ext_media_type"          "ext_alt_text"           
[31] "mentions_user_id"        "mentions_screen_name"    "lang"                   
[34] "quoted_status_id"        "quoted_text"             "quoted_created_at"      
[37] "quoted_source"           "quoted_favorite_count"   "quoted_retweet_count"   
[40] "quoted_user_id"          "quoted_screen_name"      "quoted_name"            
[43] "quoted_followers_count"  "quoted_friends_count"    "quoted_statuses_count"  
[46] "quoted_location"         "quoted_description"      "quoted_verified"        
[49] "retweet_status_id"       "retweet_text"            "retweet_created_at"     
[52] "retweet_source"          "retweet_favorite_count"  "retweet_retweet_count"  
[55] "retweet_user_id"         "retweet_screen_name"     "retweet_name"           
[58] "retweet_followers_count" "retweet_friends_count"   "retweet_statuses_count" 
[61] "retweet_location"        "retweet_description"     "retweet_verified"       
[64] "place_url"               "place_name"              "place_full_name"        
[67] "place_type"              "country"                 "country_code"           
[70] "geo_coords"              "coords_coords"           "bbox_coords"            
[73] "status_url"  

I'd suggest that we return fewer more complicated columns by default, instead providing some helpers to access them when needed. For example, we could keep media_*, quoted_* and retweet_* in media, quoted and rtweet columns and provide helpers to expand them out when needed.

@hadley
Copy link
Collaborator Author

hadley commented Apr 5, 2021

IOTW I'm suggesting that the data frame unpacking that currently occurs in tweets_to_tbl_() should be post-poned until the user requests it.

@simonheb
Copy link
Contributor

simonheb commented Oct 4, 2021

Is it intentional/temporary that since #572 get_timeline() and lists_statuses() do not return things like the screen_name of the sender anymore?

@llrs
Copy link
Collaborator

llrs commented Oct 4, 2021

It is an error I didn't detect when I merged the PR. Thanks @simonheb for asking!! It is incorrectly processed and later on lost (but should be on user(search_tweets("bla")).

@llrs
Copy link
Collaborator

llrs commented Oct 4, 2021

Sorry @simonheb I checked more about the issue and it turned out I used incorrect code, user is internal of rtweet and shouldn't be used externally.

The correct function data to retrieve the screen name is users_data which correctly returns all the information returned by the API about the user. This is what you should use and it is also documented on the search_tweets example.

@simonheb
Copy link
Contributor

simonheb commented Oct 7, 2021

Ok thanks.

But this is just a quick fix, no? In the long run lists_statuses, etc. should also return user data, no?

@llrs
Copy link
Collaborator

llrs commented Oct 7, 2021

They already return this data but it is on an attribute. It is not a quick fix, I had not to do anything here for this to work between the comments. I agree with Hadley that having 70 columns was not not practical.

At the moment I don't plan to change the columns or how the information is returned anytime soon.
Perhaps it needs it's own class and/or printing method. I might add some functions to access some of the nested lists within the object, but not going back to one big 73 wide column data.frame. I understand that this is one more breaking change but I think that in general it makes working with the output easier.

@mkearney
Copy link
Collaborator

These changes look great! It's a good idea [and more sustainable] to more closely mirror actual API data structures. I'm excited to see [and hopefully contribute again to] future changes in the pkg as well! Thank you @llrs and @hadley for all the hard but excellent work!

@llrs
Copy link
Collaborator

llrs commented Oct 21, 2021

Hi @mkearney many thanks for your encouraging words. Sorry for the surprise when you installed the development version of the package and it broke your scripts. It was not my intention when I offered to help maintaining the package.

I am aware that changing the column names will break scripts and other packages, that's one of the reasons why it will take some more time until I think about sending the package to CRAN. Perhaps there will be one other breaking change before sending it to CRAN as we were considering renaming the functions. Additionally, there are still some bugs we have introduced I would like to fix and I want to make it easier to transition from 0.7.0 version to this one. Perhaps one of the ways might be adding some helpers to extract columns and rename them. All feedback is welcome, specially if we made something harder or you have other comments to improve the package.

While we've tried to mirror more the actual API data structure one of the main reasons of its success is due to the flattening of the data it does. There is still some work to do on that front as some of the columns have now a nested structure (and save_as_csv and write_as_csv, have not been adapted yet) but it is something important for the analysis we'll keep.

@mkearney
Copy link
Collaborator

You don't need to apologize, @llrs! You've been doing all of the [great] work here, so the last thing you should worry about is my old scripts working. It seems you've been very thoughtful about everything, so in the meantime, I'll try to give this (how to ease the transition for users) some thought and see if I can't help make this happen!

@llrs
Copy link
Collaborator

llrs commented Nov 17, 2021

@mkearney I've sent you an email to the address listed on the description, not sure if you receive them...
In case you do not longer do, I wanted you to know that recently, some users asked for improvements on rtweet and suggested making a day-long rtweet hackathon to squash bugs and build in support for new API v2 of Twitter.

I suggested having the hackathon the 27th November or 4th of December. See this thread on the slack channel package-maintenance of rOpenSci. What do you think? Would you like to join the conversation there?

@llrs llrs closed this as completed Aug 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants