Skip to content
This repository has been archived by the owner on Feb 14, 2024. It is now read-only.

Fix issue 3 #4

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Fix issue 3 #4

wants to merge 11 commits into from

Conversation

SarahAlidoost
Copy link

closes #3

@SarahAlidoost
Copy link
Author

Loading data from pyphenology as below:

observations, predictors = utils.load_test_data(
    name="vaccinium", phenophase="budburst"
)
print(observations.head(5))
                species  site_id  year  doy  phenophase
0  vaccinium corymbosum        1  1991  100         371
1  vaccinium corymbosum        1  1991  100         371
2  vaccinium corymbosum        1  1991  104         371
3  vaccinium corymbosum        1  1998  106         371
4  vaccinium corymbosum        1  1998  106         371

print(observations.shape)
(48, 5)

print(predictors.head(5))
   site_id  temperature  year  doy  latitude  longitude  daylength
0        1        13.10  1990  -65   42.5429   -72.2011      10.24
1        1        13.26  1990  -64   42.5429   -72.2011      10.20
2        1        12.30  1990  -63   42.5429   -72.2011      10.16
3        1        12.15  1990  -62   42.5429   -72.2011      10.11
4        1        13.00  1990  -61   42.5429   -72.2011      10.07

print(predictors.shape)
(4356, 7)

The data is preprocessed internally as:

observed_doy, temperature_array, doy_series = mu.misc.temperature_only_data_prep(
    observations, predictors, for_prediction=False
)
print(observed_doy.shape)
(48,)
print(temperature_array.shape)
(363, 48)
print(doy_series.shape)
(363,)

@SarahAlidoost
Copy link
Author

SarahAlidoost commented Oct 17, 2023

@Peter9192 I found that applying X.flatten() wasnot correct, thanks for the hints 👍 I fixed it in this commit. I also checked the shapes of the variables, see my comment above.

Now there are still two issues:
1- creating 'doy_series': np.arange(X.shape[1]) means no gap in doys and each column in X corresponds to a doy.
2. distinguish between sklearn and pyphenology is implemented by if statement isinstance(predictors, np.ndarray) and isinstance(observations, np.ndarray). In pyphenology, the processed temperature_array has a shape of (features, samples) whereas in sklearn has (samples, features). This is the reason for X.T in the code.

@Peter9192
Copy link

creating 'doy_series': np.arange(X.shape[1]) means no gap in doys and each column in X corresponds to a doy

That's okay, I think. If we document it properly.

distinguish between sklearn and pyphenology is implemented by if statement isinstance(predictors, np.ndarray) and isinstance(observations, np.ndarray)

Perhaps instead of checking the types, we can check the shapes? For pyphenology, y (or obsevations) is 2d, whereas in the case of sklearn it should be 1d (only the doy column). Additionally/alternatively, for the predictors (X), we could see if there are column names like "temperature" and "latitude". Then you know it should be pyphenology-format, because with sklearn, as you said, all columns correspond to DOY.

In pyphenology, the processed temperature_array has a shape of (features, samples) whereas in sklearn has (samples, features). This is the reason for X.T in the code

Okay, I suppose that should be fine. Well spotted. Maybe copy this note to an inline comment?

@Peter9192
Copy link

Nice progress! I'm wondering if it also works for some of the other pyphenology models now?

@SarahAlidoost
Copy link
Author

Perhaps instead of checking the types, we can check the shapes?

Looking at the existing checks in base function, I think checking types works better than other options.

For pyphenology, y (or obsevations) is 2d, whereas in the case of sklearn it should be 1d (only the doy column).

In predict function, the input argument is only X, still a check on the shape of y cannot be implemented there.

Additionally/alternatively, for the predictors (X), we could see if there are column names like "temperature" and "latitude". Then you know it should be pyphenology-format, because with sklearn, as you said, all columns correspond to DOY.

"latitude" is not in self._required_data except for Naive model. In addition, a solution like if temperature not in predictors is not a sufficient check. In pyphenology case, there is a check for the name of required data in predictors that raises an error, see validation.validate_predictors.

@SarahAlidoost
Copy link
Author

SarahAlidoost commented Oct 18, 2023

Nice progress! I'm wondering if it also works for some of the other pyphenology models now?

not for M1 and Naive. Do you want to include those as well?
Implementation is straightforward for Naive assuming that X is latitude instead of temperature. But M1 needs temperature, doy and daylengths as predictors. How to introduce the last one in X to sklearn?

@Peter9192
Copy link

How to introduce the last one in X to sklearn?

Not sure about that. I'm already very happy that it works for all the others! Let's wrap it up without M1 for now

@SarahAlidoost SarahAlidoost marked this pull request as ready for review November 2, 2023 14:25
@Peter9192
Copy link

Tried using this with the following:

import pandas as pd
import numpy as np
import geopandas as gpd
from pyPhenology import models

# Generate test data
# random observations for 10 points:
obs = gpd.GeoDataFrame(
    data = {
        'year': np.arange(2000, 2010), 
        'DOY_firstbloom': np.random.randint(120, 180, size=10),
        'geometry': gpd.GeoSeries.from_xy(*np.random.randn(2, 10))
        },
)

# dummy temperature data for each of these years/locations, for each DOY
get_temperature = lambda year, geometry: pd.Series(np.random.randn(365), index=np.arange(1, 366), name='temperature')
weather = obs.apply(lambda row: get_temperature(row.year, row.geometry), axis=1)


# This works
model = models.ThermalTime()
model.fit(observations=obs.DOY_firstbloom.values, predictors=weather.values)
model.get_params()

# This doesn't
model = models.ThermalTime()
model.fit(observations=obs.DOY_firstbloom, predictors=weather)
model.get_params()

That's why I still think checking based on something different than types is useful.

@Peter9192
Copy link

Testing for other pyphenology models:

model_list = [ 
    models.ThermalTime(),
    models.Alternating(),
    models.Uniforc(),
    models.Unichill(),
    models.Linear(),
    models.MSB(),
    models.Sequential(),
    # models.M1(),  # Fails
    models.FallCooling(),
    models.Naive(),
]
for model in model_list:
    model.fit(observations=obs.DOY_firstbloom.values, predictors=weather.values)
    print(model.__class__.__name__, model.get_params())

Interestingly, it also works for Naive. But the output might not make sense... We might need to disable that?

@SarahAlidoost
Copy link
Author

Testing for other pyphenology models:

model_list = [ 
    models.ThermalTime(),
    models.Alternating(),
    models.Uniforc(),
    models.Unichill(),
    models.Linear(),
    models.MSB(),
    models.Sequential(),
    # models.M1(),  # Fails
    models.FallCooling(),
    models.Naive(),
]
for model in model_list:
    model.fit(observations=obs.DOY_firstbloom.values, predictors=weather.values)
    print(model.__class__.__name__, model.get_params())

Interestingly, it also works for Naive. But the output might not make sense... We might need to disable that?

I have implemented it for Naive as well, see my commit.

@SarahAlidoost
Copy link
Author

@Peter9192 I had another look at the implementations and your examples here. Now SklearnThermalTime() work with both array data and dataframe if they are sklearn compatible, see validation. For now, I added a docstring to SklearnThermalTime() that explains the data structure for thermal_time model, I couldn't find a good way to add more checks for X and y values. If data structure is pyphenology, the class ThermalTime() should be used as it is intended.
For other models, the same can be implemented but when addressing issue #2.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid duplicate data preparation
2 participants