Dynamic option for blas_set_num_threads()

I have been using Python with MKL BLAS, which defaults to dynamically setting the number of threads used by each BLAS function.  See:
https://software.intel.com/en-us/node/528546
https://software.intel.com/en-us/node/528547

In Julia, I would expect comparable speed from OpenBLAS by calling `blas_set_num_threads(CPU_CORES)`, but I would get a cold slap in the face.  I learned that OpenBLAS will automatically use one thread for small arrays, but otherwise use exactly the number of threads specified by `blas_set_num_threads()`.  In other words, my understanding is that OpenBLAS doesn't have MKL BLAS' dynamic option.

As an example, on my Haswell CPU, where Julia reports `CPU_CORES` as 8, OpenBLAS' dgemv() function (from the `develop` branch) runs fastest with `blas_set_num_threads(2)`.  It is not practical or realistic for me to put `blas_set_num_threads()` before each (hidden) call to a BLAS function.

For an Intel CPU with hyperthreading, better OpenBLAS performance would occur by using the number of physical cores instead of logical cores.  For example, `blas_set_num_threads(CPU_CORES >> 1)`  For portability, I suggest Julia includes a new constant: `CPU_PHYSICAL_CORES`.

I realize that there is already an effort to provide MKL BLAS as a build and shipping option for Julia (https://github.com/JuliaLang/julia/issues/10969).  Assuming that OpenBLAS won't go away, however, it would be helpful if Julia provided a layer of abstraction to make OpenBLAS as performant and easy to use as MKL BLAS, and to make Julia code portable between builds with either one.

Specifically, I suggest that `blas_set_num_threads(-1)` causes Julia to use a dynamic number of threads.  When built with MKL BLAS, this would cause MKL BLAS to effectively act like `MKL_DYNAMIC` is True.  When built with OpenBLAS, a lookup table for each BLAS function would determine the maximum number of threads to use.  There would be a different lookup table for each CPU architecture, such as Intel Haswell or Intel SandyBridge.  Preferably, the lookup table would be in source code so that it could be tuned by each user.  I would expect a lot of pull requests by the community for several months on the lookup tables for various CPUs, but I believe they would appreciate the speedup.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Dynamic option for blas_set_num_threads() #213

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Dynamic option for blas_set_num_threads() #213

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions