Skip to content

Commit

Permalink
raw bibtex strings, notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
TristanCantatGaudin committed Jun 13, 2024
1 parent 82837af commit 55514d2
Show file tree
Hide file tree
Showing 2 changed files with 105 additions and 55 deletions.
156 changes: 103 additions & 53 deletions docs/notebooks/SubsampleSF_HMLE_Tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,12 @@
"cells": [
{
"cell_type": "markdown",
"id": "08d746b7",
"id": "f3f35c70",
"metadata": {},
"source": [
"# 📊 Hierarchical selection function for a subset of the Gaia catalogue\n",
"# 🌳 Hierarchical selection function for a subset of the Gaia catalogue\n",
"\n",
"last updated: 2024-06-13\n",
"\n",
"The GaiaUnlimited module **subsample** provides a simple way of counting sources satisfying user-defined criteria, out of the whole Gaia catalogue. The result is given per bin of magnitude and/or colour, and per healpix region on the sky. The colour and magnitude binning can be chosen by the user, as well as the spatial resolution (order of the healpix tessellation).\n",
"\n",
Expand All @@ -16,20 +18,17 @@
},
{
"cell_type": "markdown",
"id": "629f90ec-029e-456f-9df3-81844f8a49e3",
"id": "ddd96a7b",
"metadata": {},
"source": [
"This algorithm is implemented in `SubsampleSelectionFunctionHMLE` class. Here are possible use cases:\n",
"This algorithm is implemented in `SubsampleSelectionFunctionHMLE` class. This hierarchical MLE can be applied in two ways:\n",
"\n",
"1. Use the same way as the `SubsampleSelectionFunction` class: the `subsample_query`, `file_name` and `hplevel_and_binning` values are passed to the constructor of the class. The data will be collected through the Gaia TAP+ interface then processed.\n",
"2. No parameters are passed to the constructor, an empty class instance is created. The data should be provided later by user and processed with the `use` method.\n",
"3. An instance of the `SubsampleSelectionFunction` class is passed to the `use` method. This assumes that the data has already been collected using the `SubsampleSelectionFunction` class.\n",
"4. `pandas.DataFrame` and `hplevel_and_binning` are passed to the function `use`.\n",
"5. `xarray.Dataset` and `hplevel_and_binning` are passed to the function `use`.\n",
"1. One shot using only the `SubsampleSelectionFunctionHMLE` class: the `subsample_query`, `file_name` and `hplevel_and_binning` values are passed to the constructor of the class. The data will be collected through the Gaia TAP+ interface then processed.\n",
"2. If you already have collected the data (typically with `SubsampleSelectionFunction`, but also as your own `pandas.DataFrame` or `xarray.Dataset`), pass it to `SubsampleSelectionFunctionHMLE` with its `use` method.\n",
"\n",
"We will look at some of these cases below.\n",
"Both cases are covered in this notebook.\n",
"\n",
"In either case, the constructor or the `use` function may be informed with the confidence level `z` which is the **(1-α/2)** quantile of a standard normal distribution (i.e. probit, see [Wiki](https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval)), and **α** is the confidence level:\n",
"The constructor or the `use` method may be informed with the confidence level `z` which is the **(1-α/2)** quantile of a standard normal distribution (i.e. probit, see [Wiki](https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval)), and **α** is the confidence level:\n",
"\n",
"- Confidence level = 95% => error rate = 0.05 => z = 1.96\n",
"- Confidence level = 68% => error rate = 0.32 => z = 0.99\n",
Expand All @@ -41,16 +40,16 @@
},
{
"cell_type": "markdown",
"id": "13f2c853-986d-49ab-8ca0-055333a40126",
"id": "9f939310",
"metadata": {},
"source": [
"## Use case 1"
"## Use case 1 - `SubsampleSelectionFunctionHMLE`"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "6fc778ba",
"id": "dbac36a3",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -71,7 +70,7 @@
{
"cell_type": "code",
"execution_count": 2,
"id": "a05f7ef2",
"id": "95f3dbf8",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -84,7 +83,7 @@
},
{
"cell_type": "markdown",
"id": "54bb32f6",
"id": "178e75c0",
"metadata": {},
"source": [
"Define dependencies of the selection function:\n",
Expand All @@ -100,7 +99,7 @@
{
"cell_type": "code",
"execution_count": 3,
"id": "f439a7ab",
"id": "42b259d8",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -109,7 +108,7 @@
},
{
"cell_type": "markdown",
"id": "8759897c",
"id": "7d3a906a",
"metadata": {},
"source": [
"Launch the query to the Gaia DR3 catalogue to determine which fraction of sources have a continuous XP spectrum (**has_xp_continuous**). This operation can take ~40 minutes.\n",
Expand All @@ -120,20 +119,47 @@
{
"cell_type": "code",
"execution_count": 4,
"id": "17fa0029-198e-406a-ab95-2a7298dbe5d6",
"id": "1de5dffb",
"metadata": {},
"outputs": [],
"source": [
"# Just for example: store the collected data in the local directory\n",
"os.environ['GAIAUNLIMITED_DATADIR'] = './gaiaunlimited.data'"
"os.environ['GAIAUNLIMITED_DATADIR'] = './hmle_data'\n",
"os.makedirs('./hmle_data', exist_ok=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "80b09818",
"execution_count": 5,
"id": "e629c06f",
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO: Query finished. [astroquery.utils.tap.core]\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/cantat/gaiaunlimited/evgeny/gaiaunlimited/src/gaiaunlimited/selectionfunctions/subsample.py:29: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.\n",
" if ds.dims.keys() - set([\"ipix\"]) == {\"g\", \"c\"}:\n",
"/Users/cantat/gaiaunlimited/evgeny/gaiaunlimited/src/gaiaunlimited/selectionfunctions/subsample.py:32: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.\n",
" diff = set(ds[\"logitp\"].dims) - ds.dims.keys()\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 2min 2s, sys: 5.14 s, total: 2min 7s\n",
"Wall time: 34min 4s\n"
]
}
],
"source": [
"%%time\n",
"subsampleSF_HMLE \\\n",
Expand All @@ -144,7 +170,7 @@
},
{
"cell_type": "markdown",
"id": "afd1f19a",
"id": "519baa1c",
"metadata": {},
"source": [
"Now we want to visualise the results for the entire sky, so we generate a list of coordinates of the centers of all healpix regions of order 6.\n",
Expand All @@ -157,7 +183,7 @@
{
"cell_type": "code",
"execution_count": 6,
"id": "558ff246",
"id": "a17be925",
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -223,20 +249,39 @@
},
{
"cell_type": "markdown",
"id": "d9e24d1d-9261-4cad-a014-f6ccd61e2c12",
"id": "0ce932df",
"metadata": {},
"source": [
"## Use case 3\n",
"## Use case 2 - first `SubsampleSelectionFunction`, then apply HMLE\n",
"\n",
"Here, we fetch the same data with the `SubsampleSelectionFunction` class, then pass it to the new class. Estimate the confidence interval at the 68% C.L."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "72fad95b-a0c6-45a8-a4c6-9d33d91f66ee",
"execution_count": 7,
"id": "6458bdfc",
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/cantat/gaiaunlimited/evgeny/gaiaunlimited/src/gaiaunlimited/selectionfunctions/subsample.py:29: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.\n",
" if ds.dims.keys() - set([\"ipix\"]) == {\"g\", \"c\"}:\n",
"/Users/cantat/gaiaunlimited/evgeny/gaiaunlimited/src/gaiaunlimited/selectionfunctions/subsample.py:32: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.\n",
" diff = set(ds[\"logitp\"].dims) - ds.dims.keys()\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 610 ms, sys: 162 ms, total: 772 ms\n",
"Wall time: 771 ms\n"
]
}
],
"source": [
"%%time\n",
"subsampleSF \\\n",
Expand All @@ -248,7 +293,7 @@
},
{
"cell_type": "markdown",
"id": "d33320e6-60cb-4601-b332-588428f4b5c4",
"id": "54ad1b3f",
"metadata": {},
"source": [
"Let us dive into the internals of the `SubsampleSelectionFunctionHMLE` class. We can extract more interesting information using them.\n",
Expand All @@ -258,8 +303,8 @@
},
{
"cell_type": "code",
"execution_count": 9,
"id": "19759265-afe2-46f0-86fa-febb3f5bfdf5",
"execution_count": 15,
"id": "11cb82c8",
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -310,18 +355,18 @@
},
{
"cell_type": "markdown",
"id": "33dad019-dda3-4a62-866a-e9da427e25e3",
"id": "f059f2b4",
"metadata": {},
"source": [
"As seen, the integral distribution of the number of sources (i.e. summed up by the magnitudes) have no empty healpixels (but does at higher HEALPix levels), so using the MLE as completeness estimate is fine. On the contrary, the distribution over the magnitudes have some empty bins at the highest and lowest values.\n",
"The integral distribution of the number of sources (i.e. summed up by the magnitudes) have no empty healpixels (but does at higher HEALPix levels), so using the MLE as completeness estimate is fine. On the contrary, the distribution over the magnitudes have some empty bins at the highest and lowest values.\n",
"\n",
"To see this, let us pick two directions on the sky: the poor pixel and the reach pixel."
"To see this, let us pick two directions on the sky: a poor pixel and a rich pixel."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "d0ee182c-1891-4e66-8b65-29affe9f77a5",
"execution_count": 16,
"id": "562e84e0",
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -369,7 +414,7 @@
"ds0 = subsampleSF.ds\n",
"# Augment the non-observed data with the non-informed estimate\n",
"p0 = expit(ds0['logitp'].fillna(logit(0.5)))\n",
"# NB: The MLD data is always at the highest healpix level,\n",
"# NB: The MLE data is always at the highest healpix level,\n",
"# the one that was requested with the constructor of the `SubsampleSelectionFunction` class\n",
"\n",
"# The HMLE Dataset at the required healpix level\n",
Expand Down Expand Up @@ -407,35 +452,40 @@
},
{
"cell_type": "markdown",
"id": "6649d6aa-7cf5-474e-8c9f-0721daab5cc8",
"id": "ae644978",
"metadata": {},
"source": [
"## Use case 5\n",
"## Use case 2B - apply HMLE to your own data\n",
"\n",
"Here, we fetch the same data with the `SubsampleSelectionFunction` class, then pass it to the new class as the `xarray.Dataset`. To inform the class with parameters of request, a dictionary of the parameters must be provided also.\n",
"This example is the exact same as Case 2 above (we fetched the data with the `SubsampleSelectionFunction` module), but here we pass it to the HMLE module as a `xarray.Dataset`. When you do this, you also need to explicitly provide the dictionary describing the binning (in healpix, in magnitude, etc).\n",
"\n",
"This possibility, together with the compatibility with the `pandas.DataFrame` (the use case 4) is implemented for cases, if the user have their own parent catalogue and subsample. Please, explore the sources in order to understand how these things work."
"The binning data can also be passed as a `pandas.DataFrame`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1cd12f33-2010-4268-af9f-51a16d0d7a35",
"execution_count": 10,
"id": "d9d3f0c0",
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/cantat/gaiaunlimited/evgeny/gaiaunlimited/src/gaiaunlimited/selectionfunctions/subsample.py:29: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.\n",
" if ds.dims.keys() - set([\"ipix\"]) == {\"g\", \"c\"}:\n",
"/Users/cantat/gaiaunlimited/evgeny/gaiaunlimited/src/gaiaunlimited/selectionfunctions/subsample.py:32: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.\n",
" diff = set(ds[\"logitp\"].dims) - ds.dims.keys()\n"
]
}
],
"source": [
"subsampleSF \\\n",
" = subsample.SubsampleSelectionFunction(subsample_query='has_xp_continuous', \\\n",
" file_name='dr3_xp_hpx6', hplevel_and_binning=inDict)\n",
"\n",
"subsampleSF_HMLE = subsample.SubsampleSelectionFunctionHMLE().use(subsampleSF.ds, inDict, z=0.99)"
]
},
{
"cell_type": "markdown",
"id": "aa70a77f-2cfb-4c1c-aceb-ebe01c50ec20",
"metadata": {},
"source": []
}
],
"metadata": {
Expand All @@ -454,7 +504,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
"version": "3.12.3"
}
},
"nbformat": 4,
Expand Down
4 changes: 2 additions & 2 deletions src/gaiaunlimited/selectionfunctions/survey.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
class DR2SelectionFunction(fetch_utils.DownloadMixin):
"""DR2 selection function developed by the Gaiaverse team."""

__bibtex__ = """
__bibtex__ = r"""
@ARTICLE{2020MNRAS.497.4246B,
author = {{Boubert}, Douglas and {Everall}, Andrew},
title = "{Completeness of the Gaia verse II: what are the odds that a star is missing from Gaia DR2?}",
Expand Down Expand Up @@ -136,7 +136,7 @@ class DR3SelectionFunction(DR2SelectionFunction):
(nside=1024).
"""

__bibtex__ = """
__bibtex__ = r"""
@ARTICLE{2022MNRAS.509.6205E,
author = {{Everall}, Andrew and {Boubert}, Douglas},
title = "{Completeness of the Gaia verse - V. Astrometry and radial velocity sample selection functions in Gaia EDR3}",
Expand Down

0 comments on commit 55514d2

Please sign in to comment.