-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Bulk upsert - adding new fields rather than replacing entire document #169
Comments
thanks for your question @iainmwallace and glad you like the pkg so there is a way to do updates using the bulk API https://www.elastic.co/guide/en/elasticsearch/reference/5.2/docs-bulk.html#bulk-update there are 3 possible inputs to
for the 3rd option, no problem, the user just has to create the file manually and say what operation they want to do with each row (caveat is that if files created with for data.frame or list input - we don't currently support in anyway - I can try to add support for data.frames - but I can't promise that it will be incorporated if it doesn't fit |
Thanks! I will try and write my own json files and use the docs_bulk option to upload them Adding support to add new fields if they don't exist from a data.frame, and update them if they do, would enable what I would imagine is a pretty common workflow Store info from database query A about object type 1 in elastic search |
will look into it |
haven't forgotten about this, still getting around to it |
Great! In case it is useful to others, this is how I created custom json files for upserting into elastic search to do this workflow. library(dplyr)
library(jsonlite)
library(tidyr)
library(purrr)
library(pbapply)
map_header<-function(x){
header <-
list(update = list(
"_index" = "my_index",
"_type" = "my_id",
"_id" = x
))
header<-jsonlite::toJSON(header,auto_unbox = T)
return(header)
}
map_body<-function(x){
#create property "my_dataset"
my_doc = list(
my_dataset = x
)
json_doc<-jsonlite::toJSON(list(doc = my_doc, doc_as_upsert = TRUE), auto_unbox = T)
return(json_doc)
}
create_json_body<-function(my_id_subset,my_dataset,tmp_file="tmp_elastic_files_"){
#Create document for each row in a dataset limited to specific rows
#my_dataset = dataset to load into elastic search
#my_id_subset = list of ids
my_small_dataset<-my_dataset%>%filter(id_column%in%my_id_subset)
my_tmp_file = tempfile(pattern=tmp_file,fileext = ".json")
tmp_table<-my_small_dataset%>%nest(-id_column)%>%
mutate(body=map(data,map_body))%>%
mutate(header=map(id_column,map_header))%>%
mutate(combined=paste0(header,"\n",body))
write(tmp_table$combined,file=my_tmp_file)
print(my_tmp_file)
} Examplemy_dataset<-data_frame(id_column=letters[1:26],value1=runif(26),value2=runif(26))
my_ids<-unique(my_dataset$id_column)
x<-split(my_ids, ceiling(seq_along(my_ids)/10)) # change based on how many documents per json file
pblapply(x,create_json_body,my_dataset)
files<-list.files(tempdir(),pattern="tmp_elastic_files_")
for(i in seq_along(files)){
cat(i,"\n")
invisible(
docs_bulk(
paste0(tempdir(),"/",files[i])
)
)
} |
thanks for that |
@iainmwallace putting off to the next milestone - but some work on a different branch. install like |
any thoughts @iainmwallace ? |
Hi Scott,
Just tried but had an issue finding the docs_bulk_update function discussed
in this commit
12dcb92
I have a brief write up on what I did here :
http://www.iainmwallace.com/2018/01/21/elasticsearch-and-r/
Is there something else I should be doing?
Cheers,
Iain
…On Sun, Jan 21, 2018 at 12:18 PM, Scott Chamberlain < ***@***.***> wrote:
any thoughts @iainmwallace <https://github.com/iainmwallace> ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#169 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AML4VuIInuU-YI-HM12zowxGOXWmEPLUks5tM3FjgaJpZM4MWHGR>
.
|
@iainmwallace sorry about that, just updated the branch https://github.com/ropensci/elastic/tree/bulk-update - i had forgotten to update NAMESPACE and make the man file for it. try again after reinstalling from that branch |
Thanks - I can now see the function.
When I try to run the following code, I am not able to update the data due
to the index being read only. Is there a setting somewhere that I need to
change when creating the index?
library(elastic)
connect(es_port = 9200)
df <- data.frame(name = letters[1:3], size = 1:3, id = 100:102)
index_create('test')
docs_bulk(df, 'test', 'foobar', es_ids = FALSE)
df2 <- data.frame(size = c(45, 56), id = 100:101)
docs_bulk_update(df2, index = 'foobar', type = 'foobar')
updating results in error
[[1]]$items[[2]]$update$error
[[1]]$items[[2]]$update$error$type
[1] "cluster_block_exception"
[[1]]$items[[2]]$update$error$reason
[1] "blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"
…On Sun, Jan 21, 2018 at 6:39 PM, Scott Chamberlain ***@***.*** > wrote:
@iainmwallace <https://github.com/iainmwallace> sorry about that, just
updated the branch https://github.com/ropensci/elastic/tree/bulk-update -
i had forgotten to update NAMESPACE and make the man file for it.
try again after reinstalling from that branch
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#169 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AML4VtuMnWGDL2431zWbkIc3TOqBVcVUks5tM8qcgaJpZM4MWHGR>
.
|
Thanks - the issue was that my disk was nearly full causing elastic search
force all indices to be read only (a flood stage watermark, more details
available here
https://www.elastic.co/guide/en/elasticsearch/reference/6.x/disk-allocator.html
)
The update looks great! Works as expected. Only small suggestion is that
the warning if the id column is missing incorrectly states that '_id' must
be present, when it should be just 'id'
Error in docs_bulk_update.data.frame(df3, index = "test2", type = "foobar")
:
data.frame must have a column "_id" or pass doc_ids
An equally small suggestion, it would be useful to have examples of how to
pass additional parameters through the functions. For example, I wasn't
able to figure out how to pass the read_only parameter through the
index_create function.
hope that helps :)
…On Mon, Jan 22, 2018 at 2:23 PM, Scott Chamberlain ***@***.*** > wrote:
possibly this https://stackoverflow.com/questions/34911181/how-to-
undo-setting-elasticsearch-index-to-readonly/34911897#34911897 same here
https://discuss.elastic.co/t/forbidden-12-index-read-only-
allow-delete-api/110282/4
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#169 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AML4Vgj4P3JLMJWX2K_T8HFOTTLQP7Mxks5tNOAjgaJpZM4MWHGR>
.
|
Glad you sorted out the problem, and that the fxn works. Thanks for the suggestions
|
@iainmwallace i think it's done now, merged into master, so you can |
any feedback is good, only supports data.frame's for now - added another example, so there's one for adding new rows or new columns, both the same operation really. |
Hi,
Wonderful package! New to elastic search but was wondering if it is possible to do a bulk upsert?
I want to add extra fields to what are already present.
For example, if I store the following
x<-tibble(id=letters[1:3],my_letter=LETTERS[1:3])
docs_bulk_prep(x,"test",path = tempfile(fileext = ".json"),doc_ids=x$id)
docs_bulk()
I get this as a document
{"_index":"test","_type":"test","_id":"a","_version":1,"found":true,"_source":{"id":"a","my_letter":"A"}}
I want to append on a new field "my_number". I naively repeated the process but with a different column in the data frame
x<-tibble(id=letters[1:3],my_number=1:3)
but my new document replaced the existing.
{"_index":"test","_type":"test","_id":"a","_version":2,"found":true,"_source":{"id":"a","my_number":1}}
Is there an efficient way I could approach this?
thanks
Iain
The text was updated successfully, but these errors were encountered: