Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Bulk upsert - adding new fields rather than replacing entire document #169

Closed
iainmwallace opened this issue Mar 7, 2017 · 16 comments · Fixed by #210
Closed

Comments

@iainmwallace
Copy link

Hi,

Wonderful package! New to elastic search but was wondering if it is possible to do a bulk upsert?
I want to add extra fields to what are already present.

For example, if I store the following
x<-tibble(id=letters[1:3],my_letter=LETTERS[1:3])
docs_bulk_prep(x,"test",path = tempfile(fileext = ".json"),doc_ids=x$id)
docs_bulk()
I get this as a document
{"_index":"test","_type":"test","_id":"a","_version":1,"found":true,"_source":{"id":"a","my_letter":"A"}}

I want to append on a new field "my_number". I naively repeated the process but with a different column in the data frame
x<-tibble(id=letters[1:3],my_number=1:3)
but my new document replaced the existing.
{"_index":"test","_type":"test","_id":"a","_version":2,"found":true,"_source":{"id":"a","my_number":1}}

Is there an efficient way I could approach this?

thanks

Iain

@sckott
Copy link
Contributor

sckott commented Mar 8, 2017

thanks for your question @iainmwallace and glad you like the pkg

so there is a way to do updates using the bulk API https://www.elastic.co/guide/en/elasticsearch/reference/5.2/docs-bulk.html#bulk-update

there are 3 possible inputs to docs_bulk

  • data.frame
  • list
  • file path

for the 3rd option, no problem, the user just has to create the file manually and say what operation they want to do with each row (caveat is that if files created with docs_bulk_prep there's currently no way to do anything other than the index operation) - so you can do updates right now if you create your files manually

for data.frame or list input - we don't currently support in docs_bulk to do anything other than the index operation - we could potentially add a way to either 1) pass in a vector == NROW(data.frame) or length(list) with the operation (index/create/update/delete) for each row or list chunk - or 2) allow a specific field in the data.frame or list with the operation (index/create/update/delete) - but as you can see on the docs page it's not always just a single string (index/create/update/delete) - some of those operations can take more commands making it complicated

anyway - I can try to add support for data.frames - but I can't promise that it will be incorporated if it doesn't fit

@iainmwallace
Copy link
Author

Thanks! I will try and write my own json files and use the docs_bulk option to upload them

Adding support to add new fields if they don't exist from a data.frame, and update them if they do, would enable what I would imagine is a pretty common workflow

Store info from database query A about object type 1 in elastic search
Store other info from database query B about object type 1 in elastic search
Users then query the elastic search instance

@sckott
Copy link
Contributor

sckott commented Mar 8, 2017

Adding support to add new fields if they don't exist from a data.frame, and update them if they do

will look into it

@sckott sckott modified the milestone: v0.8 Apr 19, 2017
@sckott
Copy link
Contributor

sckott commented Apr 29, 2017

haven't forgotten about this, still getting around to it

@iainmwallace
Copy link
Author

iainmwallace commented May 5, 2017

Great!

In case it is useful to others, this is how I created custom json files for upserting into elastic search to do this workflow.

library(dplyr)
library(jsonlite)
library(tidyr)
library(purrr)
library(pbapply)

map_header<-function(x){
  header <-
    list(update = list(
      "_index" = "my_index",
      "_type" = "my_id",
      "_id" = x
    ))
  
  header<-jsonlite::toJSON(header,auto_unbox = T)
  return(header)
}

map_body<-function(x){
  #create property "my_dataset"
  my_doc = list(
    my_dataset = x
  )
 
  json_doc<-jsonlite::toJSON(list(doc = my_doc, doc_as_upsert = TRUE), auto_unbox = T)
  return(json_doc)
}

create_json_body<-function(my_id_subset,my_dataset,tmp_file="tmp_elastic_files_"){
  #Create document for each row in a dataset limited to specific rows
  #my_dataset = dataset to load into elastic search
  #my_id_subset = list of ids 
  my_small_dataset<-my_dataset%>%filter(id_column%in%my_id_subset)
  my_tmp_file = tempfile(pattern=tmp_file,fileext = ".json")
  
  tmp_table<-my_small_dataset%>%nest(-id_column)%>%
    mutate(body=map(data,map_body))%>%
    mutate(header=map(id_column,map_header))%>%
    mutate(combined=paste0(header,"\n",body))
  
  write(tmp_table$combined,file=my_tmp_file)
  print(my_tmp_file)
}

Example

my_dataset<-data_frame(id_column=letters[1:26],value1=runif(26),value2=runif(26))

my_ids<-unique(my_dataset$id_column)
x<-split(my_ids, ceiling(seq_along(my_ids)/10)) # change based on how many documents per json file
pblapply(x,create_json_body,my_dataset)


files<-list.files(tempdir(),pattern="tmp_elastic_files_")
for(i in seq_along(files)){
  cat(i,"\n")
  invisible(
    docs_bulk(
      paste0(tempdir(),"/",files[i])
    )
  )
}

@sckott
Copy link
Contributor

sckott commented May 5, 2017

thanks for that

@sckott sckott added the bulk label Aug 30, 2017
@sckott sckott modified the milestones: v0.8, v0.9 Sep 12, 2017
sckott added a commit that referenced this issue Sep 12, 2017
@sckott
Copy link
Contributor

sckott commented Sep 12, 2017

@iainmwallace putting off to the next milestone - but some work on a different branch. install like devtools::install_github("ropensci/elastic@bulk-update") and let me know what you think

@sckott
Copy link
Contributor

sckott commented Jan 21, 2018

any thoughts @iainmwallace ?

@iainmwallace
Copy link
Author

iainmwallace commented Jan 21, 2018 via email

@sckott
Copy link
Contributor

sckott commented Jan 21, 2018

@iainmwallace sorry about that, just updated the branch https://github.com/ropensci/elastic/tree/bulk-update - i had forgotten to update NAMESPACE and make the man file for it.

try again after reinstalling from that branch

@iainmwallace
Copy link
Author

iainmwallace commented Jan 22, 2018 via email

@iainmwallace
Copy link
Author

iainmwallace commented Jan 22, 2018 via email

@sckott
Copy link
Contributor

sckott commented Jan 23, 2018

Glad you sorted out the problem, and that the fxn works.

Thanks for the suggestions

  • fix warning, see above
  • more examples for index_create, e.g, to pass read_only parameter

@sckott
Copy link
Contributor

sckott commented Jan 30, 2018

@iainmwallace i think it's done now, merged into master, so you can devtools::install_github("ropensci/elastic") and get latest

@sckott
Copy link
Contributor

sckott commented Jan 30, 2018

any feedback is good, only supports data.frame's for now - added another example, so there's one for adding new rows or new columns, both the same operation really.

@sckott sckott removed this from the v0.9 milestone Jun 20, 2018
@sckott sckott added this to the v0.8.4 milestone Jun 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants