If you run
cross_validate(n_jobs > 1) with your own estimator, pay attention that your estimator will be copied per job.
scores = parallel( delayed(_fit_and_score)( clone(estimator), X, y, scorers, train, test, verbose, None, fit_params, return_train_score=return_train_score, return_times=True) for train, test in cv.split(X, y, groups))
estimator_type = type(estimator) if estimator_type in (list, tuple, set, frozenset): return estimator_type([clone(e, safe=safe) for e in estimator]) elif not hasattr(estimator, 'get_params'): if not safe: return copy.deepcopy(estimator) # Serveral lines omitted new_object_params = estimator.get_params(deep=False) for name, param in six.iteritems(new_object_params): new_object_params[name] = clone(param, safe=False)
init = getattr(cls.__init__, 'deprecated_original', cls.__init__)
So basically what you list in your estimator’s
__init__ signature will be DEEP copied. Once I initialized my estimator with a 6GB matrix and run
cross_validate(n_jobs = 10)…My workstation exploded that day.
If possible, put all those heavy parameters to your
fit method. Afterall, we can call
cross_validate(n_jobs > 1, fit_params=kwargs).
It’s not enough.
fit_params will be copied as well!
numpy.memmap. See discussions:
- scikit-learn joblib bug: multiprocessing pool self.value out of range for ‘i’ format code, only with large numpy arrays
- Working with numerical data in shared memory (memmaping)
Update: Objects in
fit_params won’t be copied, only references of them will be. By default,
MultiprocessingBackend, so there would be $n$
python3 processes in the background. However, my ubuntu task manager showed that each
python3 process had taken a big chunk of memory while the total memory usage had not boomed. It looked like each process copied the parameter objects.
On the other hand, a memory map is like an in-memory index of its
.joblib file (and it’s much smaller!). The memory map will be read first to find the positions of data in that
.joblib file, then the corresponding positions will be accessed.
mmap_mode parameter of
joblib.load(filename, mmap_mode) actually means:
None, do not use memory mapping
- If not
None, use memory mapping with the mode of
mmap_mode. Available modes are the the same with
'r': Open existing file for reading only.
'r+': Open existing file for reading and writing.
'w+': Create or overwrite existing file for reading and writing.
'c': Copy-on-write: assignments affect data in memory, but changes are not saved to disk. The file on disk is read-only.