Skip to content
natalia-araujo edited this page Mar 14, 2024 · 7 revisions

Modelling Calls

Scale modeling performs an exhaustive search for best models in time series data, providing information about the fit of the best models, their cross-validation accuracy measures, and many other outputs that are usually of interest. Using the API to send the request allows for multiple requests at once, however, all datasets must contain data with the same frequency.

faas.validate_models()

function validate_models.(data_list, date_variable, date_format, model_spec, project_name)

Sends a request to 4intelligence's Forecast as a Service (FaaS) validation API.

Parameters:

  • data_list: Dict[str, pd.Dataframe]:

    Dictionary of pandas datataframes and their respective keys to be sent to the API

  • date_variable: str

    Name of the variable to be considered as the timesteps

  • date_format: str

    Format of date_variable following datetime notation (See https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)

  • model_spec: dict

    Dictionary containing arguments required for modeling. The model specifications will be the same for all datasets in the same project. The model_spec expects the following specifications:

    • n_steps: forecast horizon that will be used in the cross-validation (if 3, 3 months ahead; if 12, 12 months ahead, etc.); It should be an integer greater than or equal to 1. Typically, 'n_steps+n_windows-1' should not exceed 30% of the length of your data.

    • n_windows: how many windows the size of ‘Forecast Horizon’ will be evaluated during cross-validation (CV); It should be an integer greater than or equal to 1. Typically, 'n_steps+n_windows-1' should not exceed 30% of the length of your data.

    • log (Optional): if True apply log transformation to the data (only variables with all values greater than 0 will be log transformed); A logical parameter: True or False (Default: True).

    • seas.d (Optional): if True, it includes seasonal dummies in every estimation; A logical parameter: True or False (Default: True).

    • n_best (Optional): number of best models to be chosen for each feature selection method; Default is 20.

    • accuracy_crit (Optional): which criterion should be used to measure the accuracy of the forecast during the CV; Options: "MPE","MAPE", "WMAPE" or "RMSE" (Default: "MAPE").

    • exclusions (Optional): restrictions on features in the same model (which variables should not be included in the same model); Default is 'exclusions = []', otherwise it should receive a list of lists containing the exclusion variables in the list.

    • golden_variables (Optional): features that must be included in, at least, one model (separate or together); Default is 'golden_variables = []', otherwise it should be a list with the golden variables.

    • fill_forecast (Optional): if True, it enables forecasting explanatory variables in order to avoid NAs in future values; A logical parameter: True or False (Default is False).

    • cv_summary (Optional): determines whether ‘mean’ ou ‘median’ will be used to calculate the summary statistic of the accuracy measure over the CV windows; Options: "mean" or "median" (Default is "mean").

    • selection_methods (Optional): specifies which selection methods should be used for feature selection and whether explanatory variables should be chosen in order to avoid collinearity;

      • lasso: True if our method of feature selection using Lasso should be applied,
      • rf: True if our method of feature selection using Random Forest should be applied,
      • corr: True if our method of feature selection using Pearson correlation filter should be applied,
      • apply.collinear: True if you wish that our feature selection avoids collinearity within the explanatory variables in the models - this is equivalent to setting ["corr","rf","lasso","no_reduction"]. False or "" otherwise.
    • lags (Optional): defines dictionary of lags of explanatory variables to be tested in dataset. For example, if you wish to apply lags 1, 2 and 3 to the explanatory variables 'x1' and 'x2' from your dataset, this parameter should be specified as lags = {"x1": [1,2,3], "x2": [1,2,3]}. However, if you wish to test lags 1, 2 and 3 for all explanatory variables in the dataset(s), you can define as lags = {"all": [1,2,3]}. If, for example the user defines lags = {"all": [1,2,3], "x1": [1,2,3,4,5,6]}, lags 1, 2 and 3 will be applied to all explanatory variables, except for 'x1', which lags 1 through 6 will be tested. The default is lags = {}.

    • allowdrift (Optional): if True, drift terms are considered in arima models; A logical parameter: True or False (Default: True).

      • Can be set to True or False.
    • user_model (Optional): defines one or more models that should be included in the available models. Besides these variables, any variable that is added to regular modeling will also be in the models created from user_model. It is also possible to include a lagged variable (if defined in lags) among the variables in user_model.

  • project_name: str

    Name of the project defined by the user, that should be at most 50 characters long

Returns: API return code, and errors and/or warnings if any were found.

faas.run_models()

function run_models.(data_list, date_variable, date_format, model_spec, project_name, skip_validation= False)

Sends a request to 4intelligence's Forecast as a Service (FaaS) for modeling.

Parameters

  • data_list: Dict[str, pd.Dataframe]:

    Dictionary of pandas datataframes and their respective keys to be sent to the API

  • date_variable: str Name of the variable to be considered as the timesteps

  • date_format: str

    Format of date_variable following datetime notation (See https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)

  • model_spec: dict

    Dictionary containing arguments required for modeling. The model specifications will be the same for all datasets in the same project. The model_spec expects the following specifications:

    • n_steps: forecast horizon that will be used in the cross-validation (if 3, 3 months ahead; if 12, 12 months ahead, etc.); It should be an integer greater than or equal to 1. Typically, 'n_steps+n_windows-1' should not exceed 30% of the length of your data.

    • n_windows: how many windows the size of ‘Forecast Horizon’ will be evaluated during cross-validation (CV); It should be an integer greater than or equal to 1. Typically, 'n_steps+n_windows-1' should not exceed 30% of the length of your data.

    • log (Optional): if True apply log transformation to the data (only variables with all values greater than 0 will be log transformed); A logical parameter: True or False (Default: True).

    • seas.d (Optional): if True, it includes seasonal dummies in every estimation; A logical parameter: True or False (Default: True).

    • n_best (Optional): number of best models to be chosen for each feature selection method; Default is 20.

    • accuracy_crit (Optional): which criterion should be used to measure the accuracy of the forecast during the CV; Options: "MPE","MAPE", "WMAPE" or "RMSE" (Default: "MAPE").

    • exclusions (Optional): restrictions on features in the same model (which variables should not be included in the same model); Default is 'exclusions = []', otherwise it should receive a list of lists containing the exclusion variables in the list.

    • golden_variables (Optional): features that must be included in, at least, one model (separate or together); Default is 'golden_variables = []', otherwise it should be a list with the golden variables.

    • fill_forecast (Optional): if True, it enables forecasting explanatory variables in order to avoid NAs in future values; A logical parameter: True or False (Default is False).

    • cv_summary (Optional): determines whether ‘mean’ ou ‘median’ will be used to calculate the summary statistic of the accuracy measure over the CV windows; Options: "mean" or "median" (Default is "mean").

    • selection_methods (Optional): specifies which selection methods should be used for feature selection and whether explanatory variables should be chosen in order to avoid collinearity;

      • lasso: True if our method of feature selection using Lasso should be applied,
      • rf: True if our method of feature selection using Random Forest should be applied,
      • corr: True if our method of feature selection using Pearson correlation filter should be applied,
      • apply.collinear: True if you wish that our feature selection avoids collinearity within the explanatory variables in the models - this is equivalent to setting ["corr","rf","lasso","no_reduction"]. False or "" otherwise.
    • lags (Optional): defines dictionary of lags of explanatory variables to be tested in dataset. For example, if you wish to apply lags 1, 2 and 3 to the explanatory variables 'x1' and 'x2' from your dataset, this parameter should be specified as lags = {"x1": [1,2,3], "x2": [1,2,3]}. However, if you wish to test lags 1, 2 and 3 for all explanatory variables in the dataset(s), you can define as lags = {"all": [1,2,3]}. If, for example the user defines lags = {"all": [1,2,3], "x1": [1,2,3,4,5,6]}, lags 1, 2 and 3 will be applied to all explanatory variables, except for 'x1', which lags 1 through 6 will be tested. The default is lags = {}.

    • allowdrift (Optional): if True, drift terms are considered in arima models;

      • Can be set to True or False.
  • project_name: str

    Name of the project defined by the user, that should be at most 50 characters long

  • skip_validation: bool

    If the validation step should be bypassed

Returns: API return code, and errors and/or warnings if any were found.

Validation Error Table

The following table provides the meaning of each error code returned when calling 4intelligence's validation api (through the functions validate_models or run_models with the recommended settings)

status_code error_message valid_options
001 You have inserted a non-supported date format ano / mês / dia: "%Y/%m/%d", "%y/%m/%d", ano / dia / mês: "%Y/%d/%m", "%y/%d/%m", dia / mês / ano: "%d/%m/%Y", "%d/%m/%y", mês / dia / ano: "%m/%d/%Y", "%m/%d/%y", ano - mês - dia: "%Y-%m-%d", "%y-%m-%d", ano - dia - mês: "%Y-%d-%m", "%y-%d-%m", dia - mês - ano: "%d-%m-%Y", "%d-%m-%y", mês - dia - ano: %m-%d-%Y", "%m-%d-%y".
002 You have inserted a non-character object A character object defining the variable/parameter of interest
003 Your dependent variable does not exist in dataset A dependent variable name that exists in your dataset
004 You have inserted a variable name that is not in the dataset The unique name of the date variable in your dataset(s)
005 You have inserted a variable that cannot be converted to date, maybe it contains footnotes? The unique name of the date variable in your dataset(s)
006 Conversion of date_variable to 'data_tidy' failed data_tidy
007 data_tidy was not converted to Date type in ALL datasets Date object
008 data_tidy was not converted to Date type in SOME datasets Date object
009 date_variable was not converted to %Y-%m-%d Check https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior. E.g.: "%m/%d/%Y"
010 You have inserted a non-logical variable TRUE or FALSE
011 You have inserted a non-integer variable Any integer number greater than zero
012 You have inserted a number smaller/equal to zero and/or non-integer Any integer number greater than zero
013 You have inserted an invalid option MAPE, MPE, RMSE, WMAPE, MASE
014 You have inserted an invalid option AIC, BIC
015 You have inserted an invalid option mean, median
016 You have inserted a non-list object A list object
017 Some/all invalid variable(s) in exclusions, lags or user_model Variables that exist in your dataset or list()
018 Variables inside exclusions must be unique Unique names of variables that exist in your dataset
019 Some/all invalid variable(s) in golden_variables Variables that exist in your dataset or c()
020 You have chosen an(some) invalid method(s) c("","corr","rf","lasso","no_reduction") or simply TRUE/FALSE
021 You have inserted a dummy or categorical variable as dependent variable Numeric non-dummy dependent variable
022 NA Please report this problem to support@4intelligence.com.br
023 Please add more observations to your dataset Number of observations should be greater than (according to frequency) "daily" -> 180, "weekly" -> 52, "fortnightly" ->24, "monthly" -> 36, "bimonthly" -> 24, "quarterly" -> 24, "half-year" -> 24, "annual" -> 12
024 There is more than one observation per frequency period, make sure that you do not have more than one One observation per frequency period
025 There are too many missing values in every row Data frames with less missing values per row
026 n_steps and n_windows cover more than 50% of the size of your data [(n_steps + n_windows - 1) / nrows_training] < 0.5
027 Select at least one method for feature selection (set it as TRUE) corr = TRUE; lasso = TRUE; rf = TRUE
028 Lags defined in 'lags' must be numeric, greater than 0 and integers Numeric values such as 1, 2, 3, ...
029 Invalid variable name Variable name conflicts with lag variable (starts with 'l' and lag number) chosen by user.
030 Multiple data frequency Datasets in data_list contain more than 1 frequency
031 Exclusion with single element At least one group of exclusion contain only one element
032 Invalid prefix for variable name ('d4i_' or 'do_') At least one variable name in datasets of data_list start with 'd4i_' or 'do_'

Validation Warning Table

The following table provides the meaning of each warning code returned when calling 4intelligence's validation api (through the functions validate_models or run_models with the recommended settings)

status_code warning_message valid_options
001 One or more variables are dummies or categorical variables and will be disconsidered in exclusions set A list without dummy or categorical variables
002 One or more variables are dummies or categorical variables and will be disconsidered as golden variables A vector without dummy or categorical variables
003 One or more variables are dummies or categorical variables and will be disconsidered as variables to apply lag A list without dummy or categorical variables
004 One or more lag variables may not be included due to minimum data points requirement, linear dependency or being removed during pre-processing Lag list with fewer lags or dataset with more observations
005 No forecast period provided Additional dates in dataset to perform forecast
006 Missing values in forecast period lead to shorter or no projections Explanatory variables with projections

Utility Functions

faas.download_zip()

function download_zip.(project_id, path, filename, verbose)

Makes a request and downloads all files from a project created in FaaS Modelling or Model Update.

Parameters

  • project_id: str:

    id of the project to be downloaded - must have been concluded

  • path: str

    Folder to which the files will be downloaded

  • filename: str:

    name of the zipped file (without the .zip extension)

  • verbose: bool If the message indicating the path for the downaloaded file is to be printed

Returns: The API response

faas.list_projects()

function list_projects.(return_dict)

Retrieves a list of projects previously sent to be modelled or updated in FaaS from the user.

Parameters

  • return_dict: str

    If a dictionary should be returned instead of a dataframe

Returns: A dataframe or dictionary containing information about the user's projects