Skip to content

Commit

Permalink
fix #125, #141, #142, #143, #144
Browse files Browse the repository at this point in the history
  • Loading branch information
cbyrohl committed Feb 1, 2024
1 parent 8750896 commit e9d813f
Show file tree
Hide file tree
Showing 12 changed files with 77 additions and 53 deletions.
13 changes: 6 additions & 7 deletions docs/derived_fields.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,7 @@

!!! info

If you want to run the code below, consider using the demo data
as described [here](supported_datasets/tng.md#demo-data).
If you want to run the code below, consider downloading the [demo data](supported_datasets/tng.md#demo-data) or use the [TNGLab](supported_datasets/tng.md#tnglab) online.

Commonly during analysis, newly derived quantities/fields are to be synthesized from one or more snapshot fields into a new field. For example, while the temperature, pressure, or entropy of gas is not stored directly in the snapshots, they can be computed from fields which are present on disk.

Expand All @@ -14,14 +13,14 @@ There are two ways to create new derived fields. For quick analysis, we can simp

``` py
from scida import load
ds = load("TNG50-4_snapshot") # (1)!
ds = load("./snapdir_030") # (1)!
gas = ds.data['gas']
kineticenergy = 0.5*gas['Masses']*(gas['Velocities']**2).sum(axis=1)
```

1. In this example, we assume a dataset, such as the 'TNG50\_snapshot' test data set, that has its fields (*Masses*, *Velocities*) nested by particle type (*gas*)
1. In this example, we assume a dataset, such as the [demo data set](supported_datasets/tng.md#demo-data), that has its fields (*Masses*, *Velocities*) nested by particle type (*gas*)

In the example above, we define a new dask array called kineticenergy. Note that just like all other dask arrays and dataset fields, these fields are "virtual", i.e. only the graph of their construction is held in memory, which can be instantiated by applying the *.compute()* method.
In the example above, we define a new dask array called "kineticenergy". Note that just like all other dask arrays and dataset fields, these fields are "virtual", i.e. only the graph of their construction is held in memory, which can be instantiated by applying the *.compute()* method.

We can also add this field from above example to the existing ones in the dataset.

Expand All @@ -41,7 +40,7 @@ For this purpose, **field recipes** are available. An example of such recipe is
import numpy as np

from scida import load
ds = load("TNG50-4_snapshot")
ds = load("./snapdir_030")

@ds.register_field("stars") # (1)!
def VelMag(arrs, **kwargs):
Expand Down Expand Up @@ -109,7 +108,7 @@ def GroupDistance(arrs, snap=None):
Finally, we just need to import the *fielddefs* object (if we have defined it in another file) and merge them with a dataset that we loaded:

``` py
ds = load("TNG50-4_snapshot")
ds = load("./snapdir_030")
ds.data.merge(fielddefs)
```

Expand Down
4 changes: 2 additions & 2 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Please note that all fields within a container are expected to have the same sha
``` py
from scida import load
import dask.array as da
ds = load('TNG50-4_snapshot')
ds = load("./snapdir_030")
array = da.zeros_like(ds.data["PartType0"]["Density"])
ds.data['PartType0']["zerofield"] = array
```
Expand All @@ -27,7 +27,7 @@ As we operate with dask, make sure to cast your array accordingly. For example,
Alternatively, if you have another dataset loaded, you can assign fields from one to another:

``` py
ds2 = load('TNG50-4_snapshot')
ds2 = load("./snapdir_030")
ds.data['PartType0']["NewDensity"] = ds2.data['PartType0']["Density"]
```

Expand Down
7 changes: 3 additions & 4 deletions docs/halocatalogs.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,17 @@ Cosmological simulations are often post-processed with a substructure identifica

!!! info

If you want to run the code below, consider using the demo data
as described [here](supported_datasets/tng.md#demo-data).
If you want to run the code below, consider downloading the [demo data](supported_datasets/tng.md#demo-data) or use the [TNGLab](supported_datasets/tng.md#tnglab) online.

## Adding and using halo/galaxy catalog information
Currently, we support the usual FOF/Subfind combination and format. Their presence will be automatically detected and the catalogs will be loaded into *ds.data* as shown below.

``` py
from scida import load
ds = load("TNG50-4_snapshot") # (1)!
ds = load("./snapdir_030") # (1)!
```

1. In this example, we assume a dataset, such as the 'TNG50\_snapshot' test data set, that has its fields (*Masses*, *Velocities*) nested by particle type (*gas*)
1. In this example, we assume a dataset, such as the [demo data set](supported_datasets/tng.md#demo-data), that has its fields (*Masses*, *Velocities*) nested by particle type (*gas*)

The dataset itself passed to load does not possess information about the FoF/Subfind outputs as they are commonly saved in a separate folder or hdf5 file. For typical folder structures of GADGET/AREPO style simulations, an attempt is made to automatically discover and add such information. The path to the catalog can otherwise explicitly be passed to *load()* via the *catalog=...* keyword.

Expand Down
11 changes: 7 additions & 4 deletions docs/largedatasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@

!!! info

If you want to run the code below, you need access to the full [TNG](https://www.tng-project.org) simulation dataset.
If you want to run the code below, you need access to or download the full [TNG](https://www.tng-project.org) simulation dataset.
The easiest way to access all TNG data sets is to use the [TNGLab](https://www.tng-project.org/data/lab/), which supports [scida](https://www.tng-project.org/data/forum/topic/742/scida-analysis-toolkit-example-within-tng-lab/).

Until now, we have applied our framework to a very small simulation.
However, what if we are working with a very large data set
Expand All @@ -22,7 +23,8 @@ the `mass.sum().compute()` will chunk the operation up in a way that the task ca

```pycon
>>> from scida import load
>>> ds = load("TNG50_snapshot")
>>> sim = load("TNG50-1")
>>> ds = sim.get_dataset(99)
```

Before we start, let's enable a progress indicator from dask
Expand All @@ -33,7 +35,7 @@ Before we start, let's enable a progress indicator from dask
>>> ProgressBar().register()
```

Let's benchmark this operation on our location machine.
Let's benchmark this operation on our local machine.

```pycon
>>> %time ds.data["PartType0"]["Masses"].sum().compute()
Expand Down Expand Up @@ -108,7 +110,8 @@ We configure the job and node resources before submitting the job via the `scale
>>> client = Client(cluster)

>>> from scida import load
>>> ds = load("TNG50_snapshot")
>>> sim = load("TNG50-1")
>>> ds = sim.get_dataset(99)
>>> %time ds.data["PartType0"]["Masses"].sum().compute()
CPU times: user 1.27 s, sys: 152 ms, total: 1.43 s
Wall time: 21.4 s
Expand Down
10 changes: 7 additions & 3 deletions docs/series.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,20 @@

!!! info

If you want to run the code below, you will need to have an AREPO simulation available.
Specify the path in load() to the base directory of the simulation, which contains the "output" sub directory.

If you want to run the code below, you need a folder containing multiple scida datasets as subfolders.
Specify the path in load() to the base directory of the series.
The example below uses an AREPO simulation, the TNG50-4 simulation, as a series of snapshots.
This simulation can be downloaded from the [TNG website](https://www.tng-project.org/data/)
or directly accessed online in the [TNGLab](https://www.tng-project.org/data/lab/).

In the tutorial section, we have only considered individual data sets.
Often data sets are given in a series (e.g. multiple snapshots of a simulation, multiple exposures in a survey).
Loading this as a series provides convenient access to all contained objects.

``` pycon
>>> from scida import load
>>> series = load("TNGvariation_simulation") #(1)!
>>> series = load("TNG50-4") #(1)!
```

1. Pass the base path of the simulation.
Expand Down
28 changes: 13 additions & 15 deletions docs/supported_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,19 @@
The following table shows a selection of supported datasets. The table is not exhaustive, but should give an idea of the range of supported datasets.
If you want to use a dataset that is not listed here, read on [here](dataset_structure.md) and consider opening an issue or contact us directly.

| Name | Support | Description |
|-------------------------------------------------------|---------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| [AURIGA](https://wwwmpa.mpa-garching.mpg.de/auriga/) | :material-check-all: | Cosmological zoom-in galaxy formation *simulations* |
| [EAGLE](https://icc.dur.ac.uk/Eagle/) | :material-check-all: | Cosmological galaxy formation *simulations* |
| [FIRE2](https://wetzel.ucdavis.edu/fire-simulations/) | :material-check-all: | Cosmological zoom-in galaxy formation *simulations* |
| [FLAMINGO](https://flamingo.strw.leidenuniv.nl/) | :material-check-all: | Cosmological galaxy formation *simulations* |
| [Gaia](https://www.cosmos.esa.int/web/gaia/dr3) | :material-database-check-outline:[^1] | *Observations* of a billion nearby stars |
| [Illustris](https://www.illustris-project.org/) | :material-check-all: | Cosmological galaxy formation *simulations* |
| [LGalaxies](customs/lgalaxies.md) | :material-check-all: | Semi-analytical model for [Millenium](https://wwwmpa.mpa-garching.mpg.de/galform/virgo/millennium/) simulations |
| [SDSS DR16](https://www.sdss.org/dr16/) | :material-check: | *Observations* for millions of galaxies |
| [SIMBA](http://simba.roe.ac.uk/) | :material-check-all: | Cosmological galaxy formation *simulations* |
| [TNG](./supported_datasets/tng.md) | :material-check-all: | Cosmological galaxy formation *simulations* |
| [TNG-Cluster](https://www.tng-project.org/cluster/) | :material-check-all: | Cosmological zoom-in galaxy formation *simulations* |
| Name | Support | Description |
|-------------------------------------------------------|---------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| [AURIGA](https://wwwmpa.mpa-garching.mpg.de/auriga/) | :material-check-all: | Cosmological zoom-in galaxy formation *simulations* |
| [EAGLE](https://icc.dur.ac.uk/Eagle/) | :material-check-all: | Cosmological galaxy formation *simulations* |
| [FIRE2](https://wetzel.ucdavis.edu/fire-simulations/) | :material-check-all: | Cosmological zoom-in galaxy formation *simulations* |
| [FLAMINGO](https://flamingo.strw.leidenuniv.nl/) | :material-check-all: | Cosmological galaxy formation *simulations* |
| [Gaia](https://www.cosmos.esa.int/web/gaia/dr3) | :material-database-check-outline:<sup>[\[download\]](https://www.tng-project.org/data/obs/)</sup> | *Observations* of a billion nearby stars |
| [Illustris](https://www.illustris-project.org/) | :material-check-all: | Cosmological galaxy formation *simulations* |
| [LGalaxies](customs/lgalaxies.md) | :material-check-all: | Semi-analytical model for [Millenium](https://wwwmpa.mpa-garching.mpg.de/galform/virgo/millennium/) simulations |
| [SDSS DR16](https://www.sdss.org/dr16/) | :material-check: | *Observations* for millions of galaxies |
| [SIMBA](http://simba.roe.ac.uk/) | :material-check-all: | Cosmological galaxy formation *simulations* |
| [TNG](./supported_datasets/tng.md) | :material-check-all: | Cosmological galaxy formation *simulations* |
| [TNG-Cluster](https://www.tng-project.org/cluster/) | :material-check-all: | Cosmological zoom-in galaxy formation *simulations* |



Expand All @@ -28,5 +28,3 @@ A :material-database-check-outline: checkmark indicates support for converted HD
As of now, two underlying file formats are supported: hdf5 and zarr. Multi-file hdf5 is supported, for which a directory is passed as *path*, which contains only hdf5 files of the pattern *prefix.XXX.hdf5*, where *prefix* will be determined automatically and *XXX* is a contiguous list of integers indicating the order of hdf5 files to be merged. Hdf5 files are expected to have the same structure and all fields, i.e. hdf5 datasets, will be concatenated along their first axis.

Support for FITS is work-in-progress, also see [here](tutorial/observations.md#fits-files) for a proof-of-concept.

[^1]: The HDF5 version of GAIA DR3 is available [here](https://www.tng-project.org/data/obs/).
29 changes: 27 additions & 2 deletions docs/supported_datasets/tng.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ available at [www.tng-project.org](https://www.tng-project.org/).
Many of the examples in this documentation use the TNG50-4 simulation.
In particular, we make a snapshot and group catalog available to run
these examples. You can download and extract the snapshot and its group
catalog from the TNG50-4 test data:
catalog from the TNG50-4 test data using the following commands:

``` bash
wget https://heibox.uni-heidelberg.de/f/dc65a8c75220477eb62d/?dl=1 -O snapshot.tar.gz
Expand All @@ -19,6 +19,31 @@ wget https://heibox.uni-heidelberg.de/f/ff27fb6975fb4dc391ef/?dl=1 -O catalog.ta
tar -xvf catalog.tar.gz
```

These files are exactly [the same files](https://www.tng-project.org/api/TNG50-4/files/snapshot-30/)
that can be downloaded from the official IllustrisTNG data release.

The snapshot and group catalog should be placed in the same folder.
Then you can load the snapshot with `ds = load("./snapdir_030")`. The group catalog should automatically be detected,
Then you can load the snapshot with `ds = load("./snapdir_030")`.
If you are executing the code from a different folder, you need to adjust the path accordingly.
The group catalog should automatically be detected when available in the same parent folder as the snapshot,
otherwise you can also pass the path to the catalog via the `catalog` keyword to `load()`.

## TNGLab

The [TNGLab](https://www.tng-project.org/data/lab/) is a web-based analysis platform running a [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) instance with access to dedicated computational resources and all TNG data sets to provide
a convenient way to run analysis code on the TNG data sets. As TNGLab supports scida, it is a great way to get started and for running the examples.

In order to run the examples which use the [demo data](#demo-data), replace

``` py
ds = load("./snapdir_030")
```

with

``` py
sim = load("TNG50-4")
ds = sim.get_dataset(30)
```

for these examples.
6 changes: 4 additions & 2 deletions docs/tutorial/observations.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
This package is designed to aid in the efficient analysis of large datasets, such as GAIA DR3.

!!! info "Tutorial dataset"
In the following, we will subset from the [GAIA data release 3](https://www.cosmos.esa.int/web/gaia/dr3). The reduced dataset contains 100000 randomly selected entries only. The reduced dataset can be downloaded [here](https://heibox.uni-heidelberg.de/f/3b05069b1b524c0fa57e/?dl=1).
In the following, we will subset from the [GAIA data release 3](https://www.cosmos.esa.int/web/gaia/dr3). The reduced dataset contains 100000 randomly selected entries only. The reduced dataset can be downloaded [here](https://www.tng-project.org/files/obs/GAIA/gaia_dr3_mini.hdf5).
Check [Supported Datasets](../supported_data.md) for an incomplete list of supported datasets
and requirements for support of new datasets.
A tutorial for a cosmological simulation can be found [here](simulations.md).
Expand All @@ -17,7 +17,9 @@ It uses the [dask](https://dask.org/) library to perform computations, which has

## Loading an individual dataset

Here, we choose the [GAIA data release 3](https://www.cosmos.esa.int/web/gaia/dr3) as an example.
Here we use the [GAIA data release 3](https://www.cosmos.esa.int/web/gaia/dr3) as an example.
In particular, we support the [single HDF5 version of DR3](https://www.tng-project.org/data/obs/).

The dataset is obtained in HDF5 format as used at ITA Heidelberg. We intentionally select a small subset of the data to work with.
Choosing a subset means that the data size is small and easy to work with. We demonstrate how to work with larger data sets at a later stage.

Expand Down
2 changes: 1 addition & 1 deletion docs/tutorial/simulations.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ First, we load the dataset using the convenience function `load()` that will det

```pycon title="Loading a dataset"
>>> from scida import load
>>> ds = load("snapdir_030")
>>> ds = load("./snapdir_030")
>>> ds.info() #(1)!
class: ArepoSnapshotWithUnitMixinAndCosmologyMixin
source: /vera/u/byrohlc/Downloads/snapdir_030
Expand Down
6 changes: 2 additions & 4 deletions docs/units.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,15 @@

!!! info

If you want to run the code below, consider using the demo data
as described [here](supported_datasets/tng.md#demo-data).

If you want to run the code below, consider downloading the [demo data](supported_datasets/tng.md#demo-data) or use the [TNGLab](supported_datasets/tng.md#tnglab) online.

## Loading data with units

Loading data sets with

``` py
from scida import load
ds = load("TNG50-4_snapshot")
ds = load("./snapdir_030")
```

will automatically attach units to the data. This can be deactivated by passing "units=False" to the load function.
Expand Down
Loading

0 comments on commit e9d813f

Please sign in to comment.