Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion for discussion about conversion from result to data.frame / tibble #247

Open
rkrug opened this issue May 8, 2024 · 7 comments

Comments

@rkrug
Copy link

rkrug commented May 8, 2024

I apologise in advance for the length of this post, but I think it deserves (and needs) the space.

Thanks a lot for a great package - I really love it and it works (most off the time - I will come back to this), is reliable, no crashes, perfect.

Thanks a lot for this and for updating it regularly.

But after the last changes on OpenAlex side, the reports of problems I the conversion from the returned values from OpenAlex into a data.frame become more and more, and for me things broke down due to broken backward compatibility.

In the same way tat OpenAlex is growing more and more in data and usage, the usage of openalexR is increasing as well, and initially not envisaged use cases are cioming u.. Therefore I think that the conversion of the return values to a data.frame should be fixed in a

  1. stable to changes in OpenAlex (robust)
  2. backward compatible (if I want to, get the same structure back even when OpenAlex changes its data structures)
  3. flexible (e.g. can specify fields to return, which helps in 2.) and
  4. easy to maintain way.

I think that unfortunately, this is not the case at the moment (which is perfectly understandable given the scope of the package).

My suggestion would be to start a discussion on how a rewrite of the conversion functions could be done which fulfills a list of requirements.

I would start with the following requirements which I see as important:

  1. the functions should be stable according to changes in OpenAlex and its data structure. This means for me to return a consistent data structure, if fields are removed from OpenAlex NA values should be returned and a warning issued.
  2. all structures should be returned. E.g. in the case of authors, the structure underlying authors should be returned, even if OpenAlex does not contain the value for that entity (or should this be oimplemented in the OpenAlex API?)
  3. the conversion should be flexible, i.e. one should be able to specify fields one wants to keep. In the same sense, if one selects the fields in oa_request() (or oa_fetch_()), and necessary fields are not included in the results to fill the requested data frame, the resulting data.frame should contain NA in these fields, and issue a warning.
  4. It should be reasonably fast and memory efficient (probably caching to disc? - which is something which could also be discussed for oa_request()`)
  5. it should be easy to maintain (modular?) and
  6. all future changes should be backwards compatible when the fields to return are specified. In other words: Without fields to return specified, the ones from OpenAlex are returned, while when these are specified, the exact structure is returned (I think there have to be exceptions, while writing this...)

A conversion fulfilling these requirements, would be extremely valuable, and make the support burden much lighter (I have the impression, that most of the Issues raised del with the conversion).

I would be happy to help with this (at least certain aspects), but I would like to have some consensus in which direction to move, and an agreed framework in which the conversion function(s) would fit.

With your permission, I would be happy to post this in the rOpenSci Slack channel to get a broader input.

Thanks a lot for your patience of reading through this,

Rainer (a very happy user of openalexR)

@trangdata
Copy link
Collaborator

trangdata commented May 10, 2024

Thank you Rainer for putting this together and phrasing everything in such a kind way. As always, I appreciate your insights as a power user of openalexR!

My first thoughts:

  • We need to find a balance between progress and backward compatibility. At this stage, I don't think we should prioritize backward compatibility. OpenAlex is still in its early days, and a bunch of things are still getting deprecated. I think it is essential that openalexR stays consistent with OpenAlex's schema.
  • oa_fetch(output = "list") should work independent of oa2df() — the user can always opt for this option if they encounter error in our oa2df() function. We would love to have your contribution to a vignette on this.
  • caching is a good practice but
    • beyond the scope of openalexR,
    • especially given that most query results change every day
    • caching should be done with caution and thus should be the user's responsibility to do so.
  • It's good to keep in mind that oa2df() was first designed as a simplifying function. That was why we made decisions on what columns are most useful for a typical user to keep in the final dataframe. If a user really want to keep everything that is returned, I would advise them to use output = "list" in their oa_fetch call. Nonetheless, I agree we can make it more robust and perhaps more flexible upon the user's need (e.g. via "select").
  • The nested structure of the returned entity can make it difficult to keep nested fields as-is. We may still need to make decisions on what is good to keep.

That said, I agree a major update to v2.0.0 is warranted for openalexR to be more robust to future changes in the API.

@rkrug
Copy link
Author

rkrug commented May 10, 2024

Dear Trang

always happy to help - and I really mean what I said.

This sounds like a useful discussion!

I re-ordered your thoughsts a bit, as I think they can be grouped and probably one can find a way forward more easily.

OK - the first complex concerns the aim of os2df() as well as using output = list an approach.

  • We need to find a balance between progress and backward compatibility. At this stage, I don't think we should prioritize backward compatibility. OpenAlex is still in its early days, and a bunch of things are still getting deprecated. I think it is essential that openalexR stays consistent with OpenAlex's schema.
  • It's good to keep in mind that oa2df() was first designed as a simplifying function. That was why we made decisions on what columns are most useful for a typical user to keep in the final dataframe. If a user really want to keep everything that is returned, I would advise them to use output = "list" in their oa_fetch call. Nonetheless, I agree we can make it more robust and perhaps more flexible upon the user's need (e.g. via "select").
  • oa_fetch(output = "list") should work independent of oa2df() — the user can always opt for this option if they encounter error in our oa2df() function. We would love to have your contribution to a vignette on this.

I agree that the beauty of openalex is its simplicity, and there is no way that it should loose that. But it should be open to also accommodate for more advanced use cases, as e.g. compiling the same report over a long period of time (e.g. years) on a monthly basis, or dealing with a huge number of works to download (I will not address this, but this is another point which should be discussed imho).

I am starting to look at the backward compatibility <-> simplicity for the user. I could envisage a function, which

  1. is when using the default arguments arguments doing exactly this - selected fields are returned, no backward compatibility is guaranteed, changes in OA are reflected (as you say - OpenAlex is still changing a lot!)
  2. providing additional arguments, is backwards compatible concerning the structure, and missing fields are filled in with NA. This could be done by offering one default structures, which could be used as a starting point to devise own structures.

But the more I think about it, the more complex it gets. Maybe it would be enough, to expose robust conversion functions and an example how a complete conversion function can be assembled (this could be the works2df()) which the user can use to assemble to their own conversion function which results in the needed structure and is backward compatible?

What would be in both cases essential, is that the structures stay the same - so e.g. of not authors are available, a data.frame with all fields set to NA is returned, and not NA (which is converted to a logical(NA) based on R logic...).

Have you considered creating graphs (I love plantuml and have written a small package to create this in R (https://github.com/rkrug/plantuml/)), which illustrate which fields fro OpenAlex are converted to the oa2df() function? This could make much clearer what is converted, which fields renamed, what is not exported, etc. This could also include the conversion function names used for e.g. the author field in works.

  • caching is a good practice but
    • beyond the scope of openalexR,
    • especially given that most query results change every day
    • caching should be done with caution and thus should be the user's responsibility to do so.

Sorry for not being clear here - with caching I mean for example, when a large number of works need to be downloaded, the individual pages are saved to disk. The conversion would then read one after the other, and save them again on disk. In a last step, these would be concatenated.

I completely agree that the caching of the API responses is completely out of the scope of openalexR.

  • The nested structure of the returned entity can make it difficult to keep nested fields as-is. We may still need to make decisions on what is good to keep.

Sorry - I do not understand your point here. Nested structures can obviously easily be created and used in R, and also saved as rds files or, in my opinion the perfect format parquet files as these complex / nested data structures can be saved directly and accessed on a database like approach (through duckdb). Decisions need to be taken, but if certain nested structures should be included, could be specified by using arguments for the function?

That said, I agree a major update to v2.0.0 is warranted for openalexR to be more robust to future changes in the API.

I think that if this can be achieved, and some lower level conversion functions can be exported so that one can build easily own conversion functions, a lot would be achieved (and there are a few other functions which should be exported imho).

Good points, and openalexR2 should include a few more changes (e.g. httr2).

@rkrug
Copy link
Author

rkrug commented May 10, 2024

Oh - and I am using output = "list", but also I am not doing the conversion myself. But I think that would be a great vignette. But at the moment, the conversion is too complex to be put into a vignette (and which is why I am still using works2df(). I am also using a modified oa_fetch() function which saves each page (coro threw an error after the last update and debugging coro sounds like a mission which I wanted to avoid under time pressure). Please see https://github.com/IPBES-Data/IPBES.R/blob/5a012eec2a60cbb31d12f060c3d3f08216264046/R/corpus_pages_to_arrow.r#L96C1-L238 to what I am doing to make the structures consistent and backward compatible (https://github.com/IPBES-Data/IPBES.R/blob/5a012eec2a60cbb31d12f060c3d3f08216264046/R/corpus_pages_to_arrow.r#L186-L188) and https://github.com/rkrug/openalexR/blob/7a891dfb4397b72f47686bb375f40915a139657d/R/oa_fetch.R#L331 for my modified oa_request().

I would be happy to contribute vignettes.

@massimoaria
Copy link
Collaborator

Thank you Rainer for putting this together and phrasing everything in such a kind way. As always, I appreciate your insights as a power user of openalexR!

My first thoughts:

  • We need to find a balance between progress and backward compatibility. At this stage, I don't think we should prioritize backward compatibility. OpenAlex is still in its early days, and a bunch of things are still getting deprecated. I think it is essential that openalexR stays consistent with OpenAlex's schema.

  • oa_fetch(output = "list") should work independent of oa2df() — the user can always opt for this option if they encounter error in our oa2df() function. We would love to have your contribution to a vignette on this.

  • caching is a good practice but

    • beyond the scope of openalexR,
    • especially given that most query results change every day
    • caching should be done with caution and thus should be the user's responsibility to do so.
  • It's good to keep in mind that oa2df() was first designed as a simplifying function. That was why we made decisions on what columns are most useful for a typical user to keep in the final dataframe. If a user really want to keep everything that is returned, I would advise them to use output = "list" in their oa_fetch call. Nonetheless, I agree we can make it more robust and perhaps more flexible upon the user's need (e.g. via "select").

  • The nested structure of the returned entity can make it difficult to keep nested fields as-is. We may still need to make decisions on what is good to keep.

That said, I agree a major update to v2.0.0 is warranted for openalexR to be more robust to future changes in the API.

  • Some column names for works like "so" are artifacts from scientometrics/bibliometrics. I do think changing these names back to OpenAlex's field names (or some consistent variants) makes sense. CC @massimoaria

My idea is to declare the function oa2bibliometrix() deprecated.
This is because we are adding support for OpenAlex directly in the bibliometrix library. Both for files downloaded via the webpage and for lists exported with openalexR.
Users wishing to analyse data with bibliometrix will be able to take advantage of this support directly through the bibliometrix library.
The removal of oa2bibliometrix() will allow us full freedom in adopting column names that we consider more consistent with the OA database.

@rkrug
Copy link
Author

rkrug commented May 13, 2024

One point which came to mind: It might be useful to define the different outputs (list, data.frame, for authors, works, etc.) as S3 classes, which would make writing build-on functions to the package easier. S3 would be the easiest.

@trangdata
Copy link
Collaborator

My idea is to declare the function oa2bibliometrix() deprecated.

Sounds great I'm happy with starting a deprecation plan for oa2bibliometrix.

One point which came to mind: It might be useful to define the different outputs (list, data.frame, for authors, works, etc.) as S3 classes, which would make writing build-on functions to the package easier. S3 would be the easiest.

I think we're overcomplicating the package a bit here. openalexR is simply an API client that helps build queries and returns a list or dataframe/tibble as a result. Therefore, I don't see build-on functions as in-scope for the package.

@rkrug
Copy link
Author

rkrug commented May 13, 2024

One point which came to mind: It might be useful to define the different outputs (list, data.frame, for authors, works, etc.) as S3 classes, which would make writing build-on functions to the package easier. S3 would be the easiest.

I think we're overcomplicating the package a bit here. openalexR is simply an API client that helps build queries and returns a list or dataframe/tibble as a result. Therefore, I don't see build-on functions as in-scope for the package.

I do not see any complications on the openalexR by simply adding e.g. class(result) <- append("oa_works_df, class(result) before result is returned from works2df(). Only advantages for internal checking as well of ease for users to build on openalexR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants