Skip to content

Large datasets: support for Dask arrays? #249

@Hoeze

Description

@Hoeze

Hi, I tried training a ExplainableBoostingRegressor using Dask arrays, but I keep running into the following issue:

ERROR:interpret.utils.all:Could not unify data of type: <class 'dask.array.core.Array'>

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-53-b5e1d33190e1> in <module>
      1 model = create_model()
      2 
----> 3 fold_models, train_preds, valid_preds = model_cv(model)

<ipython-input-48-178d0374bff6> in model_cv(model)
     17 
---> 18             fold_model = sklearn.clone(model).fit(x_train_fold, y_train_fold)
     19 

/opt/anaconda/envs/ebm/lib/python3.8/site-packages/interpret/glassbox/ebm/ebm.py in fit(self, X, y)
    744         # TODO: PK don't overwrite self.feature_names here (scikit-learn rules), and it's also confusing to
    745         #       user to have their fields overwritten.  Use feature_names_out_ or something similar
--> 746         X, y, self.feature_names, _ = unify_data(
    747             X, y, self.feature_names, self.feature_types, missing_data_allowed=False
    748         )

/opt/anaconda/envs/ebm/lib/python3.8/site-packages/interpret/utils/all.py in unify_data(data, labels, feature_names, feature_types, missing_data_allowed)
    325         msg = "Could not unify data of type: {0}".format(type(data))
    326         log.error(msg)
--> 327         raise ValueError(msg)
    328 
    329     new_labels = unify_vector(labels)

ValueError: Could not unify data of type: <class 'dask.array.core.Array'>

Each of my folds is a 2D array consisting of 56 features and occupying ~16GB of memory.
Passing model.fit(X.compute(), y.compute() crashes memory after some time, probably because of Joblib copying data around unnecessarily.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions