Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

404 but website exists when running in parallel #15

Open
MichaelChirico opened this issue Jul 4, 2021 · 5 comments
Open

404 but website exists when running in parallel #15

MichaelChirico opened this issue Jul 4, 2021 · 5 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@MichaelChirico
Copy link

As seen at r-lib/lintr#828:

https://marketplace.visualstudio.com/items?itemName=REditorSupport.r-lsp triggers a 404 but I'm not having any issue navigating there in Firefox nor Chrome (also tested Chrome Incognito). Not sure what to make of the issue.

@MichaelChirico
Copy link
Author

MichaelChirico commented Jul 4, 2021

It seems to be an issue about running in parallel -- urlchecker::url_check(parallel=FALSE) passes.

@MichaelChirico MichaelChirico changed the title 404 but website exists 404 but website exists when running in parallel Jul 4, 2021
@jimhester
Copy link
Member

jimhester commented Jul 6, 2021

I get a 404 in command line curl as well. It seems like that website doesn't properly support HEAD requests.

curl -I 'https://marketplace.visualstudio.com/items\?itemName\=REditorSupport.r-lsp'
HTTP/2 404

When parallel = FALSE the code uses R's built-in curlGetHeaders() function

tryCatch(curlGetHeaders(u), error = identity)

But actually I get a 404 from that one as well.

curlGetHeaders("https://marketplace.visualstudio.com/items?itemName=REditorSupport.r-lsp")
 [1] "HTTP/2 404 \r\n" 

I am not sure why this is not showing up in both cases then, possibly it is a bug in the way the output is being handled in tools package?

@MichaelChirico
Copy link
Author

MichaelChirico commented Jul 6, 2021

I am also getting 404 from curl -I and curlGetHeaders() 🤔 but still not getting any error from url_check(parallel=FALSE):

trace(urlchecker::url_check, at=3L, quote({
    res <- tools$check_url_db(db[grepl("visualstudio", db$URL), ], parallel = parallel, pool = pool, 
    verbose = progress)
    dput(res)
    cat("Done\n")
}))

Then

urlchecker::url_check(parallel = FALSE, progress=FALSE)
# structure(list(URL = character(0), From = list(), Status = character(0), 
#     Message = character(0), New = character(0), CRAN = character(0), 
#     Spaces = character(0), R = character(0)), row.names = integer(0), class = c("check_url_db", 
# "data.frame"))
# Done

vs with parallel=TRUE:

urlchecker::url_check(parallel = TRUE, progress=FALSE)
# structure(list(URL = "https://marketplace.visualstudio.com/items?itemName=REditorSupport.r-lsp", 
#     From = list(`https://marketplace.visualstudio.com/items?itemName=REditorSupport.r-lsp` = "README.md"), 
#     Status = "404", Message = "Not Found", New = "", CRAN = "", 
#     Spaces = "", R = ""), row.names = c(NA, -1L), class = c("check_url_db", 
# "data.frame"))
# Done

Same result when tracing to use tools:::check_url_db() instead:

trace(urlchecker::url_check, at=3L, quote({
    res <- tools:::check_url_db(db[grepl("visualstudio", db$URL), ])
    dput(res)
    cat("Done\n")
}))
urlchecker::url_check(progress=FALSE)
# structure(list(URL = character(0), From = list(), Status = character(0), 
#     Message = character(0), New = character(0), CRAN = character(0), 
#     Spaces = character(0), R = character(0)), row.names = integer(0), class = c("check_url_db", 
# "data.frame"))
# Done

@MichaelChirico
Copy link
Author

OK I see now... tools:::check_url_db() runs .check_http_A() which does get curlGetHeaders() with status 404.

But then it follows up to run .curl_GET_status():

https://github.com/wch/r-source/blob/a4efc0c972d4aede0258348fd7ed6b0d7b27dd32/src/library/tools/R/urltools.R#L505

Which goes on to succeed. Why it succeeds is above my head (setting cookies maybe?):

https://github.com/wch/r-source/blob/a4efc0c972d4aede0258348fd7ed6b0d7b27dd32/src/library/tools/R/urltools.R#L786-L815

@gaborcsardi
Copy link
Member

What happens is that the base R functions tries a GET request for all the URLs that were not 200, and if the GET request is 200, then that will be used for the result. We should probably do the same in urlchecker.

@gaborcsardi gaborcsardi added the bug an unexpected problem or unintended behavior label Mar 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants