Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running model without data returns error #544

Closed
yebai opened this issue Sep 15, 2018 · 31 comments
Closed

Running model without data returns error #544

yebai opened this issue Sep 15, 2018 · 31 comments
Assignees

Comments

@yebai
Copy link
Member

yebai commented Sep 15, 2018

using Turing

@model gdemo(x) = begin
  s ~ InverseGamma(2,3)
  m ~ Normal(0,sqrt(s))
  x[1] ~ Normal(m, sqrt(s))
  x[2] ~ Normal(m, sqrt(s))
  return s, m
end

g = gdemo([1.5, 2]) # This works
g()           # returns: (0.8660317849113424, 0.820409281237518)

g = gdemo() # Returns an error.
g()

Here is the error message:

julia> g = gdemo()
┌ Warning: Data `x` not provided, treating as parameter instead.
└ @ Main ~/.julia/dev/Turing/src/core/compiler.jl:326
[ Info:  Assume - `s` is a parameter
[ Info:  Assume - `m` is a parameter
[ Info:  Assume - `x` is a parameter
gdemo_model (generic function with 4 methods)

julia> g()
ERROR: UndefVarError: x not defined
Stacktrace:
 [1] macro expansion at /Users/hg344/.julia/dev/Turing/src/core/compiler.jl:78 [inlined]
 [2] gdemo_model(::Turing.VarReplay.VarInfo, ::Turing.SampleFromPrior) at /Users/hg344/.julia/packages/MacroTools/4AjBS/src/utils.jl:297 (repeats 2 times)
 [3] top-level scope at none:0

@trappmartin

@trappmartin
Copy link
Member

I’ll have a look. Does it work if x is a scalar?

@yebai
Copy link
Member Author

yebai commented Sep 15, 2018

I’ll have a look. Does it work if x is a scalar?

Probably not, since we are indexing x (i.e. x[1] ~ Normal).

@trappmartin
Copy link
Member

This problem is related to the fact that we do not know the dimensions and type of x if we draw x from the prior.

Running the code as:

julia> using Turing, Distributions
julia> @model gdemo(x, y) = begin
  s ~ InverseGamma(2,3)
  m ~ Normal(0,sqrt(s))
  x ~ Normal(m, sqrt(s))
  y ~ Normal(m, sqrt(s))
  return x, y
end
julia> g = gdemo()
gdemo_model (generic function with 4 methods)
julia> g()
(0.685690547873451, -1.1972706455914328)

works fine with the current version of Turing.
@cpfiffer Could you please add the scalar version into the documentation?

However, idealy the user should be able to do something in the lines of

julia> @model gdemo1(N, x) = begin
  s ~ InverseGamma(2,3)
  m ~ Normal(0,sqrt(s))
  x ~ [Normal(m, sqrt(s)) for n in 1:N]
  return x
end
julia> g = gdemo(2)
julia> g() # currently doesn't work as we need x in assume, see below

or alternativelly something like:

julia> @model gdemo2(x) = begin
  s ~ InverseGamma(2,3)
  m ~ Normal(0,sqrt(s))
  N = length(x)
  x[1] ~ Normal(m, sqrt(s))
  x[2] ~ Normal(m, sqrt(s))
  return x
end
julia> g = gdemo(Array{Float64}(undef, 2))
julia> g() # currently doesn't work as we don't check if x is defined or not

in which the user has to ensure that x can actually be indexed at index 1 and 2.

@yebai I think gdemo1 would be the cleaner approach but is more involved as assume for a vector of distributions has the variable x as an argument. The variable is used to check the size and indexing scheme, which is decided based on the variable type the type of distribution.

if isa(dist, UnivariateDistribution) || isa(dist, MatrixDistribution)
for i = 1:n
push!(vi, vns[i], rs[i], dist, 0)
end
@assert size(var) == size(rs) "Turing.assume: variable and random number dimension unmatched"
var = rs
elseif isa(dist, MultivariateDistribution)
for i = 1:n
push!(vi, vns[i], rs[:,i], dist, 0)
end
if isa(var, Vector)
@assert length(var) == size(rs)[2] "Turing.assume: variable and random number dimension unmatched"
for i = 1:n
var[i] = rs[:,i]
end
elseif isa(var, Matrix)
@assert size(var) == size(rs) "Turing.assume: variable and random number dimension unmatched"
var = rs
else
@error("Turing.assume: unsupported variable container"); error()
end
end

This is related to the broadcasting issue described in #476 and will only be possible once #476 can be resolved.

The second solution should be possible but is a slightly hackish solution to the actual problem.

@trappmartin trappmartin removed the bug label Sep 17, 2018
@willtebbutt
Copy link
Member

willtebbutt commented Sep 17, 2018

Potentially controversial opinion: I think that @yebai 's example should break. As @trappmartin says, nothing about x is known when the function is invoked, other than that we're assigning to it. More fundamentally, x is an input to that model, and I really don't think that we should be able to run a model that requires arguments without providing them (it's just inconsistent with what one generally expects from a function-like thing), consequently I also think that @trappmartin 's scalar example should break. I do, however, think that we should be able to provide observational data as arguments when appropriate, so here's a proposal that would work nicely, should be consistent with function-like things, and wouldn't really diverge too much from our current approach:

  1. make compositional modelling work, so that I can say some variables in one model bar are distributed according to some other Turing model foo that I've already defined.
  2. If a model requires arguments, then they must be provided (as with any other function)
  3. To get code re-use, a good design pattern is to split the modelling process into two parts - specify a prior model as a Turing model that doesn't take arguments that are used as observational data (foo in our example above), and specify a separate observation model (bar in our example). Then if I want to sample from the prior I just run foo, but if I want to condition on data I use bar. If you try to run bar without providing data, it produces the error that @yebai found no matter what is going on internally.

This would resolve the inconsistent / complicated behaviour in the current design whereby sometimes you can run a model that has arguments without providing any arguments, but not always, and whether you can or not depends on exactly what types you have used in your model -- if a model has arguments, you must provide them (there's no reason not to allow default values in the same way that regular functions do though). Moreover, we would still be able to write a stdlib of "prior" models that make no / few assumption about what RVs might be observed, and compose these together to make larger models. We would also be able to provide "observation" models that are a really thin wrapper around prior models that could provide standard scenarios eg. for mixture models one could specify a prior model that returns all of the random variables, and a separate observation model that requires observations are provided but not mixture components.

@trappmartin
Copy link
Member

I think I generally agree. This would also simply the current compiler code and would be potentially much cleaner.

I’m just not quite sure what would be a good syntax.

julia> @model gmodel() = begin
  s ~ InverseGamma(2,3)
  m ~ Normal(0,sqrt(s))
  x ~ Normal(m, sqrt(s))
  return x
end
julia> @model gdemo(x) = begin
  N = length(x)
  x ~ [gmodel() for n in 1:N]
end

@willtebbutt
Copy link
Member

willtebbutt commented Sep 17, 2018

There are two separate things going on here though: the way we handle sampling from the prior vs providing data, and how to specify vectors of random variables. A scalar version of your example could be

@model prior() = begin
    s ~ InverseGamma(2, 3)
    m ~ Normal(0, sqrt(s))
    x ~ Normal(m, sqrt(s))
    return x
end
@model observation(x) = begin
    x ~ prior()
end

which I think is completely fine. As regards the specification of vectors of RVs, I think that there are a few ways to go about this. I suspect that we should try to avoid upgrading our @~ macro (or whatever succeeds it) to handle stuff other than x ~ distribution-like statements. It seems to me that this will probably lead to an unbounded number of weird edge cases to support. The options as I see them are

  1. stuff like
@model observation(x) = begin
    for n in eachindex(x)
        x[n] ~ prior()
    end
end

is total fine to my mind, if a little bit verbose and should already work, if I'm not mistaken.
2. as discussed in #476 it would be really nice if broadcasting worked properly, and models behave like distributions, such that so that we could write stuff like

x .~ [prior() for n in eachindex(x)]

The goal here would be to completely avoid having to customise any broadcasting behaviour, so that this would just naturally work. This is a whole separate kettle of fish though.
3. in certain situations, such as the one described here, it might be really helpful to introduce a product distribution-like construct which represents a collection of independent RVs (see JuliaStats/Distributions.jl#722 for my thoughts on this). This isn't a general solution though. This would let you write things like

@model observation(x) = begin
    x ~ Product([prior() for _ in eachindex(x)])
end

(the precise name of the distribution is quite reasonable to have some discussion over).

So my thoughts on what we should actually do are refactor the compiler a bit, and worry about vectors of RVs separately.

@cpfiffer
Copy link
Member

I've put the current sampling methodology in the documentation for now, but I'll update if/when it changes.

@yebai
Copy link
Member Author

yebai commented Sep 17, 2018

It makes a lot of sense to further push the compiler design towards

  • compositional modelling and
  • improved consistency with standard Julia syntax

For this particular case, I think we can leverage the missing value type. For example,

using Turing

@model gdemo(x) = begin
  s ~ InverseGamma(2,3)
  m ~ Normal(0,sqrt(s))
  x[1] ~ Normal(m, sqrt(s))
  x[2] ~ Normal(m, sqrt(s))
  return s, m
end

g = gdemo([1.5, 2]) # standard version

g = gdemo([missing, missing]) # treat x as parameter and draw from prior

This would allow us the both

  • simplify the compiler by removing function alias (e.g. no more default value for x)
  • still, allow a user to re-use the same model definition (e.g. drawing from the prior)

I think the burden of ensuring dimentionality match between arguments and actual use should be left on the user, for example, if the user writes

using Turing

@model gdemo(x) = begin
  s ~ InverseGamma(2,3)
  m ~ Normal(0,sqrt(s))
  x ~ Normal(m, sqrt(s)) # no broadcasting syntax
  return s, m
end

g = gdemo(1.0)
g = gdemo(missing) 

Then, the user shouldn't be doing g = gdemo([1.0 2.0]) or g = gdemo([missing, missing]). That said, I do agree we can have a nice broadcasting syntax for defining IID variables.

@willtebbutt
Copy link
Member

I do like your missing value proposal @yebai -- I think the ideal would be to support both proposals as they're not in conflict.

@trappmartin
Copy link
Member

I agree that the broadcasting issue and the issue regarding draws from a prior are two independent issues.

At the moment the solution of @yebai is the most tangible from my point of view. For the compositional model proposal, we would need to construct something that behaves like a distribution. I see the appeal for that and I think it's the way to go on the long run. Maybe let us create an additional issue to discuss a possible redesign of the compiler.

@willtebbutt
Copy link
Member

Yeah, we've definitely discussed this in other issues (I'm not sure which ones). I do have a habit of bringing the "models should just behave like distributions" thing up on a weekly basis, so apologies for that everyone.

yebai pushed a commit that referenced this issue Sep 18, 2018
* Moving over to Documenter.jl groundword.

* More documentation cleaning and prep

* Moving over to Documenter.jl groundword.

* More documentation cleaning and prep

* Travis settings updated

* Travis fixes

* Changed doc/build format to .md

* Move to Documenter.jl native HTML builds

* Update logo-build.jl

* Minor updates as discussed in pull #512

* Moved from Pkg notation to the new Pkg3 notation.

* Getting this to serve from `docs/` directory.

* Adding `docs/build` files.

* Set theme jekyll-theme-cayman

* Revert "Set theme jekyll-theme-cayman"

This reverts commit 38f5916.

* index.md relocation

* Commented out unused/index.md

* Set up doc/ directory to hold Documenter.jl generated documentation

* Testing a second index.md file

* Fixing error in make.jl

* Changed _config.yml to ignore src/ directory

* Small cleanup.

* refactoring of compiler

* removed data macro

* refactoring of tilde macro and fixed minor bug. Observer test broken.

* improved comment

* changed IS to work with new model function scheme

* changed hmc core to work with new model function, changed style accoring to guidelines

* changed model function calls

* fixing compiler tests

* fixed varinfo test

* re-added kwarg checking

* fixed missing data insertion problem

* minor fix

* lot of changes, wip

* code now uses MacroTools, with a few exceptions

* work in progress

* more refactoring, solving prior sampling in #446

* removed strange tabs

* minor change in particlecontainer test

* added fix for multiple return statements

* info -> debug

* fixed typo

* runmodel -> runmodel!

* unifying initilistation for unconstrained distributions

* changing vi.logp inplace, added some tests

* added fix for #475 by implementing a data(Dict, Vector{Symbol}) function

* typo

* changed default assume behavior to sample from the prior

* added more tests for the compiler

* changed back to use initialisation instead of prior draws

* changed default value of CHUNKSIZE to 10

* Introduce a special algorithm for initializing HMC.

* Updates to meet suggestions in #512

* Bug fixes, latex support.

* a little cleanup

* remove AD auto_tune function and sym_str macro

* move io.jl and util.jl from core to utilities

* logsum -> logsumexp

* remove some unused stuff

* remove some type annotation because VarInfo is defined later

* fix tests

* make logsumexp more efficient

* use StatsFuns for logsumexp

* clean test REQUIRE

* clean up

* minor fix

* added MacroTools to REQUIRE

* Minor fixes + style from meeting

* Tidy up AD interface + remove cleandual! and realpart!. (#515)

Lots of commits. Some stuff went really wrong. Sorry.

* Moving over to Documenter.jl groundword.

* Unify AD interface

* Use new gradient + rename stuff in hmc_core@

* Let SGHMC and SGLD use reverse-mode

* Remove cleandual

* More documentation cleaning and prep

* Moving over to Documenter.jl groundword.

* More documentation cleaning and prep

* Travis settings updated

* Travis fixes

* Changed doc/build format to .md

* Move to Documenter.jl native HTML builds

* Prevent gradient from mutating it's argument

* Minor changes

* Fix bug involving types of traces

* Resolve a bunch of formatting issues

* Some more tidying up

* Misc reformatting / removal of redundant code

* Fix bug, remove printing, and remove code

* Remove redundant code

* Fix typo

* Refactor leapfrog.

* Copy old documentation over

* Remove some other stuff that I shouldn't have added

* Resolves conflicts with #523

* Small fix.

* Small fix.

* Update REQUIRE

* Fix to move .html files to the gh-pages branch.

* Updated .gitignore

* Tidying up

* Changes for #512.

* changed LineNumberNode initialisation, issue #534

* Updated guides to document scalar prior sampling, #544.

* Added a 'quick start' page. #512

* Removed git workflow section, added links to more maintained guides. #535

* Update README.md

* Better testing for transforms + code which conforms to the style guide (#543)

* Test logpdf_with_trans for univariate distributions

* Make sure that MvNormal and LogMvNormal tests pass

* Adds broken tests

* Various other tests + implementation for LogMvNormal

* Update Turing.jl

* Minor guide updates.

* Moving over to Documenter.jl groundword.

* Moving over to Documenter.jl groundword.

* More documentation cleaning and prep

* More documentation cleaning and prep

* Travis settings updated

* Travis fixes

* Changed doc/build format to .md

* Move to Documenter.jl native HTML builds

* Update logo-build.jl

* Minor updates as discussed in pull #512

* Moved from Pkg notation to the new Pkg3 notation.

* Getting this to serve from `docs/` directory.

* Set theme jekyll-theme-cayman

* Revert "Set theme jekyll-theme-cayman"

This reverts commit 38f5916.

* index.md relocation

* Commented out unused/index.md

* Set up doc/ directory to hold Documenter.jl generated documentation

* Testing a second index.md file

* Fixing error in make.jl

* Changed _config.yml to ignore src/ directory

* Updates to meet suggestions in #512

* Bug fixes, latex support.

* Fix to move .html files to the gh-pages branch.

* Updated .gitignore

* Tidying up

* Changes for #512.

* Updated guides to document scalar prior sampling, #544.

* Added a 'quick start' page. #512

* Removed git workflow section, added links to more maintained guides. #535

* Minor guide updates.
yebai pushed a commit that referenced this issue Sep 18, 2018
* Moving over to Documenter.jl groundword.

* More documentation cleaning and prep

* Moving over to Documenter.jl groundword.

* More documentation cleaning and prep

* Travis settings updated

* Travis fixes

* Changed doc/build format to .md

* Move to Documenter.jl native HTML builds

* Update logo-build.jl

* Minor updates as discussed in pull #512

* Moved from Pkg notation to the new Pkg3 notation.

* Getting this to serve from `docs/` directory.

* Adding `docs/build` files.

* Set theme jekyll-theme-cayman

* Revert "Set theme jekyll-theme-cayman"

This reverts commit 38f5916.

* index.md relocation

* Commented out unused/index.md

* Set up doc/ directory to hold Documenter.jl generated documentation

* Testing a second index.md file

* Fixing error in make.jl

* Changed _config.yml to ignore src/ directory

* Small cleanup.

* refactoring of compiler

* removed data macro

* refactoring of tilde macro and fixed minor bug. Observer test broken.

* improved comment

* changed IS to work with new model function scheme

* changed hmc core to work with new model function, changed style accoring to guidelines

* changed model function calls

* fixing compiler tests

* fixed varinfo test

* re-added kwarg checking

* fixed missing data insertion problem

* minor fix

* lot of changes, wip

* code now uses MacroTools, with a few exceptions

* work in progress

* more refactoring, solving prior sampling in #446

* removed strange tabs

* minor change in particlecontainer test

* added fix for multiple return statements

* info -> debug

* fixed typo

* runmodel -> runmodel!

* unifying initilistation for unconstrained distributions

* changing vi.logp inplace, added some tests

* added fix for #475 by implementing a data(Dict, Vector{Symbol}) function

* typo

* changed default assume behavior to sample from the prior

* added more tests for the compiler

* changed back to use initialisation instead of prior draws

* changed default value of CHUNKSIZE to 10

* Introduce a special algorithm for initializing HMC.

* Updates to meet suggestions in #512

* Bug fixes, latex support.

* a little cleanup

* remove AD auto_tune function and sym_str macro

* move io.jl and util.jl from core to utilities

* logsum -> logsumexp

* remove some unused stuff

* remove some type annotation because VarInfo is defined later

* fix tests

* make logsumexp more efficient

* use StatsFuns for logsumexp

* clean test REQUIRE

* clean up

* minor fix

* added MacroTools to REQUIRE

* Minor fixes + style from meeting

* Tidy up AD interface + remove cleandual! and realpart!. (#515)

Lots of commits. Some stuff went really wrong. Sorry.

* Moving over to Documenter.jl groundword.

* Unify AD interface

* Use new gradient + rename stuff in hmc_core@

* Let SGHMC and SGLD use reverse-mode

* Remove cleandual

* More documentation cleaning and prep

* Moving over to Documenter.jl groundword.

* More documentation cleaning and prep

* Travis settings updated

* Travis fixes

* Changed doc/build format to .md

* Move to Documenter.jl native HTML builds

* Prevent gradient from mutating it's argument

* Minor changes

* Fix bug involving types of traces

* Resolve a bunch of formatting issues

* Some more tidying up

* Misc reformatting / removal of redundant code

* Fix bug, remove printing, and remove code

* Remove redundant code

* Fix typo

* Refactor leapfrog.

* Copy old documentation over

* Remove some other stuff that I shouldn't have added

* Resolves conflicts with #523

* Small fix.

* Small fix.

* Update REQUIRE

* Fix to move .html files to the gh-pages branch.

* Updated .gitignore

* Tidying up

* Changes for #512.

* changed LineNumberNode initialisation, issue #534

* Updated guides to document scalar prior sampling, #544.

* Added a 'quick start' page. #512

* Removed git workflow section, added links to more maintained guides. #535

* Update README.md

* Better testing for transforms + code which conforms to the style guide (#543)

* Test logpdf_with_trans for univariate distributions

* Make sure that MvNormal and LogMvNormal tests pass

* Adds broken tests

* Various other tests + implementation for LogMvNormal

* Update Turing.jl

* Minor guide updates.

* Moving over to Documenter.jl groundword.

* Moving over to Documenter.jl groundword.

* More documentation cleaning and prep

* More documentation cleaning and prep

* Travis settings updated

* Travis fixes

* Changed doc/build format to .md

* Move to Documenter.jl native HTML builds

* Update logo-build.jl

* Minor updates as discussed in pull #512

* Moved from Pkg notation to the new Pkg3 notation.

* Getting this to serve from `docs/` directory.

* Set theme jekyll-theme-cayman

* Revert "Set theme jekyll-theme-cayman"

This reverts commit 38f5916.

* index.md relocation

* Commented out unused/index.md

* Set up doc/ directory to hold Documenter.jl generated documentation

* Testing a second index.md file

* Fixing error in make.jl

* Changed _config.yml to ignore src/ directory

* Updates to meet suggestions in #512

* Bug fixes, latex support.

* Fix to move .html files to the gh-pages branch.

* Updated .gitignore

* Tidying up

* Changes for #512.

* Updated guides to document scalar prior sampling, #544.

* Added a 'quick start' page. #512

* Removed git workflow section, added links to more maintained guides. #535

* Minor guide updates.
@mohamed82008
Copy link
Member

After discussions with @yebai, this issue will be resolved in #613 by allowing the user to specify a default value for the inputs when defining the model for when they are treated as parameters. For example, the following will be equivalent:

@model gdemo(x = Vector{Real}(undef, 2)) = begin
    s ~ InverseGamma(2,3)
    m ~ Normal(0,sqrt(s))
    x[1] ~ Normal(m, sqrt(s))
    x[2] ~ Normal(m, sqrt(s))
    return s, m
end
g = gdemo()
g()

and

@model gdemo() = begin
    x = Vector{Real}(undef, 2)
    s ~ InverseGamma(2,3)
    m ~ Normal(0,sqrt(s))
    x[1] ~ Normal(m, sqrt(s))
    x[2] ~ Normal(m, sqrt(s))
    return s, m
end
g = gdemo()
g()

In other words, when a data variable is not input, it is treated as a parameter. If a default value is assigned, it is used, otherwise it is assigned to nothing which is the current behavior. nothing works for scalars but not vectors. This is analogous to the default value syntax in Julia. Let me know if you have any objections.

@mohamed82008
Copy link
Member

mohamed82008 commented Dec 20, 2018

So it seems that a few people were unhappy about the above solution for the following reason if I understood correctly. If you want to set a default "data" value for x above while keeping it as dvar as opposed to just treating it as a pvar, then it is not possible with the above syntax. So how can we allow the following:

  1. Treat missing vector data variables as parameters.
  2. Allow for default data value.

One proposal that was discussed today was to use Vector{Missing} to signal to the compiler that we want to treat x as a parameter, while letting us know the length of x so we can define our own Vector{Real}(undef, n) inside the model body. So any value that is not Vector{Missing} will keep x as data. The Vector{Missing} can be set in the model definition as the default value of x or it can be passed as input when constructing the model using the outer function.

At first glance, this seems like a reasonable solution but it will most likely bite us in the neck later when we want to be able to infer the type of x to use a type stable VarInfo. Just knowing the length is not enough to know the type of vi.vals for this variable, our best guess is Real. I thought of a few alternatives but each seems to have at least one problem. Perhaps one of the best alternatives from a compiler POV may be:

@model gdemo(x::Vector{Float64}=[missing, missing])
    ....
end

So using the above syntax, the user can specify if x is made of Float64s, or Ints allowing for VarInfo specialization when treating it as a parameter. And at the same time, the default value can be assigned to be [missing, missing] in which case x will be treated as a parameter. Or the default value can be any other meaningful default value in which case x will be treated as data. To make the above more Julian, we should probably make the user write ::Union{Vector{Missing}, Vector{Float64}}. What I don't like about this solution is that it is pretty lengthy given that we only want 2 pieces of information: 1) the length of x, and 2) the eltype of x if treated as parameter. Alternatively, we can use the following syntax:

@model gdemo(x=Param{Vector{Float64}, 2}())
    ....
end

Param{Vector{Float64}, 2}() can be made to be short for Vector{Float64}(undef, 2). So using the above, we know the eltype and length of x and it is clear from a user POV, that x will be a Parameter by default of this type and length. If we go with this, we won't need the Vector{Missing} anymore; nothing will do as we would already know the length and eltype from the model definition.

@mohamed82008
Copy link
Member

If this change turns out to be controversial, I will remove it from the compiler PR and we can discuss it later.

@cpfiffer
Copy link
Member

I'm partial to the x=Param{Vector{Float64}, 2}() solution, mostly due to the lengthiness of x::Vector{Float64}=[missing, missing].

Maybe this has already been discussed, but is there a reason we can't just do something basic like this:

@model gdemo(x = Param(300, Float64))
   . . . .
end

Just a minor thought. I am still a fan of the x=Param{Vector{Float64}, 2}() syntax, though.

@trappmartin
Copy link
Member

trappmartin commented Dec 20, 2018

I'm generally in favour of using missing because it automatically allows us to handle missing values, which I believe is important in many domains. And it's semantically clear what happens.

I understand the appeal for something like x = Param{Vector{Float64}, 2}(). But I would consider this to be an additional syntax that we might support. The syntax using missing values seems more relevant to me.

@cpfiffer
Copy link
Member

So, if we were to use missing syntax, could I pass in a vector like [2,missing,5] and have the middle value in that vector be an assumption? Or would that not work?

Perhaps that's a poor example. I don't know why someone would have that kind of data.

@trappmartin
Copy link
Member

I think that’s where we should be heading. Reasons for missing data can be diverse, in the simplest case think about sensory data where a sensor doesn’t provide data at time t. Using probabilistic modelling in such cases is attractive as it allows to handle missing data.

@mohamed82008
Copy link
Member

@cpfiffer Param(300, Float64) is also possible and we default to Vector.

I'm generally in favour of using missing

@trappmartin if we want to allow this fine granularity of missing data, then we will have to check if each variable is observed or assumed at run-time. I am ok with this, if we want to go that way. So basically, the partially data variable can be of type Vector{Union{T, Missing}} and we can assume any missing value. Union{T, Missing} is efficient in Julia, so that may not hurt performance much. But note that this doesn't solve the main problem addressed above, which is what happens when the whole vector is missing. We still need to know the eltype and length of vector.

@mohamed82008
Copy link
Member

mohamed82008 commented Dec 21, 2018

so that may not hurt performance much

Actually, it should not hurt the performance at all for full parameters and full data variables because if use some if statement like this:

if x[i] isa Missing
    ...
else
   ...
end

and x is of type Vector{Float64} then the Julia compiler should be able to compile this if statement away. So this may be a pure gain if done right.

@trappmartin
Copy link
Member

As said, I’m fine with the syntax you propose and I’m also fine with the more clumsy syntax using missing values. And inserting those if statements would be a great feature in my opinion.

But let’s keep @yebai and @willtebbutt in the loop to find a sensible solution everyone agrees on.

@willtebbutt
Copy link
Member

willtebbutt commented Dec 21, 2018

I'm actually fairly opposed to Turing automatically doing something with arrays comprising a mix of Missing and observed data. Whatever implementation we eventually use will necessarily add complexity to the compiler, and it's not clear to me whether or not this kind of feature is actually that useful. So I guess that I have a few thoughts:

  • Have we received any indication that this kind of feature is in high demand?
  • If so, is there a solution that we can suggest that doesn't involve us baking it into the language? i.e. some design pattern we can suggest when writing programmes to make it easy for users to work around missing data?

@trappmartin
Copy link
Member

I’m not aware of a feature request on missing values.

Regarding solutions, in a Bayesian framework the obvious solution is to treat missing values as unknown parameters. Alternativ solutions are for example to exclude observations with missing values.

I personally think it would be good to handle missing values such as other PPLs do. Especially as the overhead in the compiler would be minimal. But I’ll not try to push it if there are reasons not to support missing values.

@willtebbutt
Copy link
Member

I personally think it would be good to handle missing values such as other PPLs do.

@trappmartin Could you point me towards an example of this please?

in a Bayesian framework the obvious solution is to treat missing values as unknown parameters

I don't disagree. My point is that there is more than one way to realise this in practice. For example, if as a user I have vector of data and I know that a subset of the data is missing, then it's usually quite straightforward to chop it up into two vectors by hand, one containing missing stuff and the other observed stuff. You can then treat these as two separate vectors in the model, and continue as usual. I would argue that this is a straightforward solution to this problem, which doesn't require any work on our side. It also means that there's one language feature for the user to concern themselves with, and won't usually add a huge amount of work on their part.

There are also some non-trivial technical challenges. For example, if

x ~ MvNormal(m, C)

and we want to provide x as data with some Missing elements then it's clear how to proceed conceptually, but I don't really know what we would need to do from an implementation perspective. It somehow feel different to the

for n in ...
    x[n] ~ Foo
end

situation, where my impression is that we have a better idea of how to implement stuff.

I'm open to being convinced about this, but the crux of my argument against doing it is basically that

  • It's not a crucial feature (straightforward workarounds exist),
  • It's not trivial to implement / maintain, and
  • we have other priorities

@yebai where do you stand on this?

@trappmartin
Copy link
Member

Hm, I’m not sure I understand your approach. 😅

Most cases of missing values I have encountered where those in which a sensor would sometimes not provide a measurement. Meaning that you would usually observe measurements from multiple sensors at a time and that some sensor sometimes lacks information due to whatever reason, e.g. out of range values. How would I proceed with your approach?

I’m open to everything btw.

Regarding other PPLs, Stan and pymc can handle missing values to some extent.

@trappmartin
Copy link
Member

https://docs.pymc.io/notebooks/getting_started.html

See: Case study 2: Coal mining disasters

@yebai
Copy link
Member Author

yebai commented Dec 21, 2018

Hmm, I raised this issue of replacing undef with missing yesterday. My primary motivation is to make semantics of Mohamed’s syntax more explicit, ie, they are missing values, not any data. I can see @trappmartin point for support missing values, I think it is important, but should be done carefully in a separate effort. I lean towards @willtebbutt idea for keeping this out of the compiler: mixing missing values and data at a compiler level might lead to unintended confusions.

@mohamed82008 perhaps let’s focus on the dimension issue in the compiler PR, ie, to replace undef with missing and always treat them as parameters.

@mohamed82008
Copy link
Member

@yebai may you write down some test cases and expected behavior? There are a few ideas thrown around in this thread and some are incompatible with others, so I want to nail down what is actually required.

@mohamed82008
Copy link
Member

I think the 2 main use cases I have in mind are:

@model gdemo(x) = begin
    s ~ InverseGamma(2,3)
    m ~ Normal(0,sqrt(s))
    x[1] ~ Normal(m, sqrt(s))
    x[2] ~ Normal(m, sqrt(s))
    return s, m
end
g = gdemo([missing, missing]) # treats x as a parameter of length 2 and eltype Real

and

g = gdemo(Union{Missing, Float64}[missing, 1.0]) # what should this be?

If we decide to treat x in the second case as a full parameter of length 2 and eltype Float64, then @trappmartin 's use case of partially missing data will be blocked AFAICT. If we introduce a new syntax for defining parameters of specific length and eltype like Param(2, Float64), then we can still accommodate Martin's use case. All of this is orthogonal to the default value of x which can be assigned using the syntax:

@model gdemo(x = default_value) = begin
    ....
end

This default value could be meaningful data or the missing vectors from above signalling that x is a parameter.

@mohamed82008
Copy link
Member

Since the specifications of this change that everyone agrees on are still not clear to me, I will remove it from the compiler PR and do it in a separate PR.

@yebai
Copy link
Member Author

yebai commented Dec 22, 2018

Hi @mohamed82008, sorry for the confusion. Let's only support the following syntax in the compiler PR (for now):

@model gdemo(x = default_value) = begin
    ....
end

to provide a way to initialize data variables, i.e.

  • variable x is always treated as a parameter in sample(gdemo()), regardless whether default_value contains a missing value or a float64.
  • variable x is always treated as observed data in sample(gdemo([0.0 1.0])

The following syntax should be forbidden before we explicitly support missing values during inference.

g = gdemo([missing, missing])               # returns error at runtime
g = gdemo(Union{Missing, Float64}[missing, 1.0]) # return error at runtime

mohamed82008 added a commit that referenced this issue Dec 28, 2018
yebai pushed a commit that referenced this issue Dec 28, 2018
* compiler refactor proof of concept

* rename compiler to model_info

* improve compiler readibility and make it more compact

* invokelatest -> calling model

* respond to Martin's comments

* fix ambiguity

* fix missing data case and make more compact

* fix broken compiler.jl/explicit_ret.jl test

* fix sampling from prior

* make compiler more compact

* fix update_vars!

* fix tests

* fix mh.jl tests

* fix Sampler API

* uncomment commented tests

* fix seed in test/mh.jl/mh_cons.jl

* make pvars and dvars type parameters in CallableModel

* make data a field in CallableModel

* fix #544 - allows default value of data vars when treated as params

* add support for passing data as kwargs to outer function

* model::Function -> model

* CallableModel -> Model

* compiler hygiene and organization

* docstrings and some renaming

* respond to Will's comments

* some more style issues

* add test for #544

* comment

* model -> model::Model
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants