Skip to content

Dynamic option for blas_set_num_threads() #213

Closed
@hiccup7

Description

@hiccup7

I have been using Python with MKL BLAS, which defaults to dynamically setting the number of threads used by each BLAS function. See:
https://software.intel.com/en-us/node/528546
https://software.intel.com/en-us/node/528547

In Julia, I would expect comparable speed from OpenBLAS by calling blas_set_num_threads(CPU_CORES), but I would get a cold slap in the face. I learned that OpenBLAS will automatically use one thread for small arrays, but otherwise use exactly the number of threads specified by blas_set_num_threads(). In other words, my understanding is that OpenBLAS doesn't have MKL BLAS' dynamic option.

As an example, on my Haswell CPU, where Julia reports CPU_CORES as 8, OpenBLAS' dgemv() function (from the develop branch) runs fastest with blas_set_num_threads(2). It is not practical or realistic for me to put blas_set_num_threads() before each (hidden) call to a BLAS function.

For an Intel CPU with hyperthreading, better OpenBLAS performance would occur by using the number of physical cores instead of logical cores. For example, blas_set_num_threads(CPU_CORES >> 1) For portability, I suggest Julia includes a new constant: CPU_PHYSICAL_CORES.

I realize that there is already an effort to provide MKL BLAS as a build and shipping option for Julia (JuliaLang/julia#10969). Assuming that OpenBLAS won't go away, however, it would be helpful if Julia provided a layer of abstraction to make OpenBLAS as performant and easy to use as MKL BLAS, and to make Julia code portable between builds with either one.

Specifically, I suggest that blas_set_num_threads(-1) causes Julia to use a dynamic number of threads. When built with MKL BLAS, this would cause MKL BLAS to effectively act like MKL_DYNAMIC is True. When built with OpenBLAS, a lookup table for each BLAS function would determine the maximum number of threads to use. There would be a different lookup table for each CPU architecture, such as Intel Haswell or Intel SandyBridge. Preferably, the lookup table would be in source code so that it could be tuned by each user. I would expect a lot of pull requests by the community for several months on the lookup tables for various CPUs, but I believe they would appreciate the speedup.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceMust go fasterspeculativeWhether the change will be implemented is speculativeupstreamThe issue is with an upstream dependency, e.g. LLVM

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions