Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expand fields to allow multiple values #4

Open
pdowler opened this issue Jun 6, 2024 · 15 comments
Open

expand fields to allow multiple values #4

pdowler opened this issue Jun 6, 2024 · 15 comments
Labels
2.5 enhancement New feature or request

Comments

@pdowler
Copy link
Member

pdowler commented Jun 6, 2024

this comes from an archive partners slack discussion started by David Rodrigues

proposal.id
telescope.name
instrument.name
plane.energy.bandpassName

... maybe more

@pdowler pdowler added the enhancement New feature or request label Jun 6, 2024
@pdowler pdowler changed the title multiple proposal ID values in an observation fields with multiple values in an observation Jun 6, 2024
@pdowler pdowler transferred this issue from opencadc/caom2 Jul 5, 2024
@pdowler pdowler changed the title fields with multiple values in an observation expand fields to allow multiple values Jul 18, 2024
@dr-rodriguez
Copy link

Some examples from MAST.
Looks like we haven't been consistent in whether we should use _, ;, | to split strings:

jw01282-c1010_t005_miri_f1550c-mask1550

<bandpassName>
	F1550C;4QPM_1550
</bandpassName>

hlsp_tasoc_tess_ffi_tic00166699853-s0004-cam2-ccd2-c1800_tess_v05

<proposal>
	<id>
	       G011160_G011155_G011188
        </id>
</proposal>

hlsp_ullyses_hst-fuse_fuse-stis_sk-71d8_uv

<telescope>
	<name>
		FUSE;HST
	</name>
</telescope>
<instrument>
	<name>
		FUV | STIS/FUV-MAMA
	</name>
</instrument>

@pdowler
Copy link
Member Author

pdowler commented Jul 30, 2024

What does it mean if a plane has multiple bandpassName(s)? I recognize the form of F1550C as a filter around 1550nm...

Does it mean multiple filters in the light path, so the Plane.energy.bounds would be the intersection of the two?

In a DerivedObservation, one could stack inputs in different filters to make a wider Plane.energy.bounds, possibly with meaningful sub-samples. I know there are also some "white light" images like that...

These two possibilities kind of reduce to "light in filter A and/or B" respectively (logical and/or)... would the model only need one interpretation or also a way to specify and vs or?

@pdowler
Copy link
Member Author

pdowler commented Jul 30, 2024

For multiple proposals, I'm not really sure what the usage means. I can imagine a queue-scheduled observatory optimising by taking one observation that two different proposals requested and wanting to "assign" it to both of them... (see below)

If a proposal has an observing plan that includes making stacks, then DerivedObservation(s) with proposal information makes sense. That's why Proposal is common to all Observation, not just SimpleObservation.

However, if a 3rd party were to decide to stack SimpleObservation(s) from different proposals, I would not want to assign any proposal information to the resulting DerivedObservation: the proposal indicates "this thing exists because of proposal xyz" and that's just not true of all DerivedObservation(s). They are more likely to fit "this thing exists because of project abc" (which would be in Plane.provenance.project).

@pdowler
Copy link
Member Author

pdowler commented Jul 30, 2024

below:

There was a radio use case where they collect data (SimpleObservation) of a large region of the sky and different parts of the data are destined for different proposals (and the implied different permissions). The solution there was that we renamed CompositeObservation to DerivedObservation and the plan would be to have the SimpleObservation (owned by the observatory) and create a DerivedObservation for each proposal. The extra catch there was that in radio one would actually want to extract different subsets of the data for each of those so it did not overlap with the "two observations include the same artifact issue".

So, that was one way that one data acquisition was "assigned" to two proposals, but it is a special case where there really are new data created by some processing.

@pdowler
Copy link
Member Author

pdowler commented Jul 30, 2024

on multiple telescopes: I assume the MAST use case where is "DerivedObservation made by combining data from multiple telescopes". In radio (interferometry) there is also VLBI observations that involve mutliple telescopes and there is the "multiple dishes in a single facility" and there I don't know where the line between telescope -- collector -- detector -- correlator lies. I'm toying with the Telescope and Instrument classes and if that looks promising I might pull this out as a separate issue.

As it stands, we could not change cardinality of Telescope.name by itself because multiple telescopes are not at the same location. If we go ahead with removing WCS from CAOM for 2.5, the need for Telescope.geoLocation* would kind of go away (I think it's only useful for some spectral reference frame transforms).

@dr-rodriguez
Copy link

What does it mean if a plane has multiple bandpassName(s)? I recognize the form of F1550C as a filter around 1550nm...

Does it mean multiple filters in the light path, so the Plane.energy.bounds would be the intersection of the two?

In a DerivedObservation, one could stack inputs in different filters to make a wider Plane.energy.bounds, possibly with meaningful sub-samples. I know there are also some "white light" images like that...

These two possibilities kind of reduce to "light in filter A and/or B" respectively (logical and/or)... would the model only need one interpretation or also a way to specify and vs or?

I've actually seen both situations (and vs or).

I think the most common are DerivedObservations that are stitched spectra across multiple bandpasses (eg, HASP, HLSP). So it's the wavelength coverage of each stacked together. Our implementation in Plane.energy.bounds is the UNION of all wavelength ranges covered in those cases.

There are also a few examples where there is more than one optical element in place, for example a filter plus a grating. I believe in that situation we've ignored or glossed over any impact of the grating and just captured the wavelength range of the filter. In those cases, I think we've used something like / to separate the grating from the filter.
In the example above, F1550C;4QPM_1550 refers to a JWST MIRI observation in the F1550C filter (about 15.5 microns) that utilizes a coronagraph mask optimized for that wavelength: https://jwst-docs.stsci.edu/jwst-mid-infrared-instrument/miri-observing-modes/miri-coronagraphic-imaging#gsc.tab=0
One could argue the grating/mask/coronagraph isn't really part of the bandpass, which is fair, but we're using this field to capture that information so users can search specifically for it. Perhaps this argues for a more generalized optical element section in the model?

I can imagine situations were observers may want to have two filters so have a much narrower bandpass (eg, suppressing red light from an inefficient blue filter). The bounds I would want to record for those are the INTERSECTION rather than the UNION of the filters. That isn't always possible though, and I don't have a good feel for how frequent it happens among the MAST datasets.

I think what I would like to have is the multiple storing of bandpasses to support the first use case: each element in the bandpass list is something that will be UNION-ed together to form the final bounds range.

@dr-rodriguez
Copy link

dr-rodriguez commented Aug 7, 2024

For multiple proposals, I'm not really sure what the usage means. I can imagine a queue-scheduled observatory optimising by taking one observation that two different proposals requested and wanting to "assign" it to both of them... (see below)

If a proposal has an observing plan that includes making stacks, then DerivedObservation(s) with proposal information makes sense. That's why Proposal is common to all Observation, not just SimpleObservation.

However, if a 3rd party were to decide to stack SimpleObservation(s) from different proposals, I would not want to assign any proposal information to the resulting DerivedObservation: the proposal indicates "this thing exists because of proposal xyz" and that's just not true of all DerivedObservation(s). They are more likely to fit "this thing exists because of project abc" (which would be in Plane.provenance.project).

For the TESS mission, it's as you initially describe: it's a survey-like situation where users can propose to extract select stars at a higher cadence than the full-frame images. The mission office selects these targets and while they all have the same PI (George Ricker, the PI of the mission itself), the targets get associated with every proposal that submitted them.

Another example is the Hubble Advanced Products (HAP), specifically the multi-visit mosaics. These are DerivedObservations that are mosaics made from separate programs. The artifacts of these observations are new images of some area of the sky that are drizzled, rotated, and stacked together. The mission specifically asked us to capture every program that has been used in them so that users can search for multi-visit mosaics produced from specific programs. I believe for these we have Plane.proposal.project be HAP-MVM. That is, these observations exist because of the HAP-MVM project, but the original programs are indicated in Observation.prpID.

This same situation will happen with the spectroscopic analogue, HASP, once they start stacking spectra across visits.

@dr-rodriguez
Copy link

on multiple telescopes: I assume the MAST use case where is "DerivedObservation made by combining data from multiple telescopes". In radio (interferometry) there is also VLBI observations that involve mutliple telescopes and there is the "multiple dishes in a single facility" and there I don't know where the line between telescope -- collector -- detector -- correlator lies. I'm toying with the Telescope and Instrument classes and if that looks promising I might pull this out as a separate issue.

As it stands, we could not change cardinality of Telescope.name by itself because multiple telescopes are not at the same location. If we go ahead with removing WCS from CAOM for 2.5, the need for Telescope.geoLocation* would kind of go away (I think it's only useful for some spectral reference frame transforms).

This has been used for High-Level Science Products (HLSPs), where things like spectra from more than one telescope/instrument have been stitched together to have broader spectral coverage. I think HASP (Hubble Advanced Spectroscopic Products) currently only uses HST data, but they do combine multiple instruments and create new products from that.

@pdowler
Copy link
Member Author

pdowler commented Aug 7, 2024

bandpassName

I think I agree that multiple bandpassName(s) should be the wider union usage you describe. User queries could still use bandpassName = 'F555w' to find exactly that filter or they could use bandpassName LIKE '%F555w%' to find all data using that filter. That's consistent query style to polarization (I/Q/U/V cube, for example). There would be a restricted character in bandpassName values that would be used to serialise them into a single column (for TAP queries); that separator in keywords is |, which is notably not consistent with the / usage in polarization states (from ObsCore).

The narrower multiple filter intersection thing belongs in a possible enhancement of the instrument model (eg to describe the path of the signal through components before reaching the detector. In that case, you could still construct a single bandpassName with two filter names in it using a different separator (eg F555w+F490). We could possibly reserve a character for this usage... TBD.

In either case, Plane.energy.bounds would give the correct/representative numeric band limits.

@pdowler
Copy link
Member Author

pdowler commented Aug 7, 2024

telescopes/instruments

I have increased the scope of #11 to cover this topic.

@pdowler
Copy link
Member Author

pdowler commented Aug 7, 2024

proposalID

I feel like proposalID itself should be a reference into an (external) proposal system where one would also find metadata that was consistent with the limited subset that is part of the Proposal class itself. Simple allowing multiple proposalID values all with the same PI, title, keywords seems like it would be confusing.

This really is part of what provenance should cover and in principle one could write a query to extract the proposalID(s) of all the member observations (admittedly, it would be complicated because the join would bloat the query result and sincve 2.4 we allow members to be Observation so there's an arbitrary sequence of joins to collect all the initial proposalID values). Does that give essentially the same result?

I'm thinking more along the lines an optional Proposal.memberProposals [0..*] : String but it would be more or less an optimization... thoughts?

@dr-rodriguez
Copy link

proposalID

At MAST, our UI has a link to HST/JWST proposals for additional information. That's hard-coded into the UI (something along the lines of if HST, take observation.prpID and make it a link), I believe it might be broken for multi-proposal cases but that's a separate matter.

The situation about extracting from member observations is that some use-cases don't have members. For example, the TESS lightcurves can have multiple proposal IDs but they have no members.
For multi-proposal situations, I think we're either not capturing title/keywords/PI or using a global one for the entire collection. So the multiplicity of having ID associated to one PI but not another is not something we've come across.

As for an optional Proposal.memberProposals- that doesn't sound too bad to me in cases where it's appropriate.
Would DerivedObservations that are mosaics/stitched spectra not have Observation-level proposal information then?

My biggest worry is that it may make things a little more complicated- some observations would have proposal information in one place, others in another. We may be forced at MAST to consolidate them so even if we add a place for Proposal.memberProposals, we might be forced to also copy the same value to the Observation level which defeats the purpose of having it to begin with.

@pdowler
Copy link
Member Author

pdowler commented Aug 8, 2024

I was thinking that Proposal.memberProposals as a mechanism to denote attribution: this DerivedObservation was made from data that originates from these member observations. It still seems like an optimization that more generically is already part of "members" and "inputs" (provenance).

It also does not solve the other issue where a SimpleObservation is part of (assigned to) multiple proposals; in this sense I'm thinking of proposalID as an indicator of access rights; that could be a PI querying to "find all my data"... those proposals that share an observation do have different PI, title, keywords and need to be distinct Proposal objects.

So it's really the composition Observation.proposal that would change from 0..1 to 0..* and that's pretty painful to make the database and querying sane... need to think about how to denormalize/flatten than into something useful.

@pdowler
Copy link
Member Author

pdowler commented Aug 13, 2024

proposalID

To change the cardinality here, we would have to change the model so that an Observation simply had 0..* proposalID values and if present those could be used in a 1-n join to a set Proposal(s) -- eg a separate table. This would essentially be undoing the denormalisation that CAOM has to make Observation the root class in every instance. That would make Proposal an entity on it's own that would be harvested separately. It is a more normalised representation that ultimately makes Observation less stand-alone than it currently is. There are 3 major denormalisations in the CAOM design:

Collection-Observation -> Observation w/ collection field
Telescope-Instrument-Observation -> Observation w/ telescope and instrument fields
Proposal-Target-TargetPosition-Requirements -> Observation with those 4 separate fields

Normalising Proposal would allow an observation to belong to 2+ proposals, but it is definitely a slippery slope because those proposals don't necessarily have the same target (objects), specific target positions, and very likely not the same requirements (eg an observation might meet requirements for one proposal but fail wrt. the other). So this would be quite a mess.

There is a draft IVOA ProposalDM so any effort to normalise the Proposal class/model would likely have to take that into account in addition to the current MAST-ESAC proposal metadata details. The current denormalisation is clearly a "copy a few useful bits" and doesn't really conflict with whatever happens in that other work.


For the use case of commensural observing (assign one observation to 2 proposals for implied or explicit access rights) the currently recommended approach of creating 1 SimpleObservation (no proposal) and 2 DerivedObservation with one Proposal each works. The algorithm.name would indicate that the derived are copies or it could indicate they are subsets of the single member (that's one of the radio use cases). That does mean there are 3 observations for potentially the same science data (same Artifact.uri). It would be plausible to not create planes and/or artifacts for the base SimpleObservation and thus more or less "hide it" from normal queries. Unless there is some subset operation, the planes and artifacts of the two derived observations would be identical in one extreme (down to same Artifact.uri), but could in principle differ in processing and thus have different Plane metadata and different files (Artifact.uri). So, this approach allows for some additional redundancy and would likely create two paths to the same file (see #2) that would need to be allowed in the degenerate case.

Although re-use of Artifact.uri allows one to share a file between two observations, it would probably be confusing to just make two SimpleObservation (different Observation.observationID, same Artifact.uri) because it would not be easy to distinguish that from a mistake.

For the use case of tracking all uses of data from proposal X or all the proposals that contributed data to this (derived) observation, that is a provenance issue (navigating forward or backwards respectively). The model as it stands can support such queries but they are probably best tackled with navigation (drill down to details). There is an IVOA Provenance DM and the CAOM Provenance is a very simple one step provenance in that context.


At this point, I don't think changing cardinality of proposalID is feasible. Of course, the field is an opaque string (with collection-specific meaning) so there are no rules about values in place, but I think it would be confusing, dangerous, and maybe short sighted to abuse it with multiple values. I think it is fine for DerivedObservation(s) created outside the scope of a Proposal to not have any proposal info at all: essentially no proposal info could mean unknown or multiple.

@pdowler
Copy link
Member Author

pdowler commented Aug 13, 2024

final result:

@pdowler pdowler added the 2.5 label Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.5 enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants