Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large values in data are not passed properly to Stan #1023

Closed
martinmodrak opened this issue Jun 17, 2021 · 17 comments · Fixed by #1034
Closed

Large values in data are not passed properly to Stan #1023

martinmodrak opened this issue Jun 17, 2021 · 17 comments · Fixed by #1034
Labels

Comments

@martinmodrak
Copy link

Describe the bug
When putting very large numbers (roughly around 1e9) into data, wrong walues may get passed to stan.

To Reproduce

m_data <- cmdstan_model(write_stan_file("
data {
  real x;
}

model {
  print(x);
}
                                   "))

m_data$sample(chains = 1, iter_warmup = 1, iter_sampling = 0, fixed_param = TRUE,
         data = list(x = 3e8))
# Prints 3e8

m_data$sample(chains = 1, iter_warmup = 1, iter_sampling = 0, fixed_param = TRUE,
         data = list(x = 3e9))
# Prints  -1.29497e+09 

Expected behavior

Print 3e9 in the second case

Operating system
Windows 10

CmdStanR version number
0.4.0.9000 (just installed from GitHub)

@jgabry
Copy link
Member

jgabry commented Jun 17, 2021

Wow, that's super weird! But this is actually a Stan (or CmdStan) issue and not an issue with CmdStanR passing the wrong value. I verified that the JSON file passed to CmdStan is correct and I also ran CmdStan itself (not via CmdStanR) and the same problem occurs. I'll transfer the issue to the CmdStan repo.

@jgabry jgabry transferred this issue from stan-dev/cmdstanr Jun 17, 2021
@jgabry jgabry added the bug label Jun 17, 2021
@jgabry
Copy link
Member

jgabry commented Jun 17, 2021

I just transferred this issue from CmdStanR to CmdStan because I verified that the same problem happens when using CmdStan directly without CmdStanR. But I'm not sure if this is a CmdStan bug or if it should be moved to stan-dev/stan.

@mitzimorris
Copy link
Member

perhaps it's an I/O error. is the input data for CmdStan in rdump or JSON format?
JSON doesn't distinguish between int and real, even so, 10^9 is should be w/in signed int range; OTOH, Windows might be using very short ints.

@jgabry
Copy link
Member

jgabry commented Jun 17, 2021

perhaps it's an I/O error. is the input data for CmdStan in rdump or JSON format?

The example provided by Martin uses CmdStanR so it was using JSON I/O (and I verified the JSON file was written correctly and also that the same problem occurs without CmdStanR just using CmdStan).

OTOH, Windows might be using very short ints.

This happens on my Mac and Martin's Windows computer so it doesn't seem to be related to the OS.

@mitzimorris
Copy link
Member

mitzimorris commented Jun 18, 2021

OK then, will investigate.

@martinmodrak
Copy link
Author

Thanks for looking into this so quickly and for the additional verification!

@rok-cesnovar
Copy link
Member

The problem is that 3e9 is written in JSON or R dump as 3000000000 and that is parsed as an integer. If I place

x <- 3000000000

in a .data.R I get an error that its beyond int range.
In JSON however this:

{
  "x" : 3000000000
}

gets parsed as an integer and then seems to get converted to a double in some bizarre way.

This however works fine

{
  "x" : 3e9
}

@jgabry
Copy link
Member

jgabry commented Jun 18, 2021

gets parsed as an integer and then seems to get converted to a double in some bizarre way.

Any idea why it's a problem with 3e9 but doesn't seem to be a problem with 3e8?

@rok-cesnovar
Copy link
Member

I am guessing it has to do with the fact that 3e8 can be represented as an integer, but 3e9 can not be.

@alyst
Copy link
Contributor

alyst commented Aug 10, 2021

Are the values as read by the JSON parser already incorrect?

@alyst
Copy link
Contributor

alyst commented Aug 22, 2021

It looks like RapidJSON supports JSON schema.
Maybe it's possible to generate SchemaDocument DOM for the Stan model data block and provide the Schema to the JSON parser?

@rok-cesnovar
Copy link
Member

Are the values as read by the JSON parser already incorrect?

Yeah, the logic is supposed to be that the parser first attempts parsing a number without a decimal point as an integer and once it recognizes the number is too big it parses its as a double. That seems to not be working as expected. I havent had the time to look into that. Will do ASAP.

Maybe it's possible to generate SchemaDocument DOM for the Stan model data block and provide the Schema to the JSON parser?

That is doable as we can get info on types on data variables from stanc3. I would avoid this for now but we can start to think about it if there is no solution for the above.

@alyst
Copy link
Contributor

alyst commented Aug 23, 2021

I've looked into the values that generate problems in my case.
For example

{"qData": [2245935000, 2418300000] }

But it looks like at least rapidjson master parses them well (as Uint on my machine). But then in cmdstan I get qData[sym1__] is -2.04903e+09 for the first value. And qData[sym1__] is -1.87667e+09 for the second one.

@mitzimorris
Copy link
Member

the logic is supposed to be that the parser first attempts parsing a number without a decimal point as an integer

adding a check that the sign is as expected would catch this problem, no?

@alyst
Copy link
Contributor

alyst commented Aug 23, 2021

For me it looks like the cause of the error is at json_data_handler.hpp#L228 when unsigned is converted to signed. For the numbers from the example above this leads to the flip of the sign.

@rok-cesnovar
Copy link
Member

Integers in Stan are always signed. I guess we should not let rapidjson parse them to uint as uint is useless for the Stan use case.

@alyst
Copy link
Contributor

alyst commented Aug 23, 2021

Integers in Stan are always signed. I guess we should not let rapidjson parse them to uint as uint is useless for the Stan use case.

IIUC the rapidjson code, it's not possible to change which handler would be called.
Unsigned number would be always processed by Uint[64] handlers. But the logic of the cmdstan handler could be changed.
Probably -- as it's already done for the reals -- the array has to be switched to values_r if the overflow would occur when converting unsigned to int.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants