Undertanding dynamic and static libraries symbol conflicts
Like most posts in this blog, I wrote this one to help me understand a topic better. They might be wrong or incomplete, so please let me know if you find any mistakes. I might have completely misunderstood too.
Here’s the problem I wanted to investigate:
LibTorch (the C++ library behind the torch) is compiled and distributed as a shared library by the PyTorch team. The official Linux x64_86 distributions are statically linked against Intel MKL - a high-performance BLAS alternative. This can be verified by looking at the symbols in the library and noticing that we have mkl_blas symbols there.
(venv) rstudio@3a53a82b44f1:~/data/torch/build-lantern/libtorch$ nm lib/libtorch_cpu.so | grep mkl_blas | head
0000000017d2e660 B .gomp_critical_user_mkl_blas_cgemm_omp_acopy_la_cs
0000000017d2e658 B .gomp_critical_user_mkl_blas_dgemm_omp_acopy_la_cs
0000000017d2e648 B .gomp_critical_user_mkl_blas_sgemm_omp_acopy_la_cs
0000000017d2e650 B .gomp_critical_user_mkl_blas_zgemm_omp_acopy_la_cs
000000000a1f1f60 T mkl_blas_avx2_cgemm_api_support
000000000a1dc080 T mkl_blas_avx2_cgemm_blk_info_bdz
000000000e662600 T mkl_blas_avx2_cgemm_cccopy_down2_ea
000000000e65ee00 T mkl_blas_avx2_cgemm_cccopy_right12_ea
000000000c70e800 T mkl_blas_avx2_cgemm_ccopy_down12_ea
000000000c70dc00 T mkl_blas_avx2_cgemm_ccopy_down2_ea
The official distributions of R, though, are dynamically linked against the reference BLAS for those platforms. sessionInfo()
reports the BLAS library that R is using.
> sessionInfo()
4.4.1 (2024-06-14)
R version : x86_64-pc-linux-gnu
Platform: Ubuntu 22.04.3 LTS
Running under
: default
Matrix products: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0 LAPACK
This means that when we load the torch package in R, we have two BLAS libraries loaded in memory: the reference BLAS and the MKL BLAS. Now, which one will be used when we multiply tensors in torch?
The order of initialization is:
- R is initialized and loads the reference BLAS
- We load the torch package, which loads the libTorch shared library, which has the MKL BLAS symbols
I’d expect that LibTorch calls would continue to use their statically linked MKL BLAS symbols and R calls would continue to use the reference BLAS. But that’s not what really happens.
Given this problem, it’s easier to understand what happens if we take R, LibTorch and BLAS out of the equation and create a simple example with two shared libraries that have the same symbol and are loaded in the same process in the order that we described above.
Simple experiment
So I built the following experiment:
- Create static library
libA
that implementsprint()
(representing MKL BLAS). - Create shared library
libB
that links statically tolibA
and callsprint()
(represents LibTorch). - Create shared library
libAShared
that implementsprint()
(represents the reference BLAS). - Created an executable that is links to
libAShared
dynamically and that loads B dynamically at runtime too. (representing R)
I won’t go into too much details about the code. Essentially we build libA
, libB
and libAShared
with cmake definitions as below:
libs/CMakeLists.txt
add_library(A STATIC libA.cpp)
add_library(AShared libAShared.cpp)
add_library(B)
target_sources(B PUBLIC libB.cpp)
target_link_libraries(B PUBLIC A)
Then codes for libA
, libB
and libAShared
are very simples:
// libs/libA.cpp
#include <iostream>
extern "C" void print() {
std::cout << "Hello from libA!" << std::endl;
}
// libs/libB.cpp
extern "C" void print();
// only needed so the linker really includes the function
// print() from the statically linked libA.
extern "C" void print2 () {
();
print}
// libs/libAShared.cpp
#include <iostream>
extern "C" void libprint() {
std::cout << "Hello from libAShared!" << std::endl;
}
The executable is built with the following cmake code:
CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(TwoLibs)
# Option to build shared or static libraries
option(BUILD_SHARED_LIBS "Build shared libraries" ON)
# Add subdirectories
add_subdirectory(libs)
add_executable(binary main.cpp)
target_link_libraries(binary AShared)
And its code is:
#include <dlfcn.h>
#include <iostream>
extern "C" void print();
int main() {
void *handle = dlopen("libs/libB.dylib", RTLD_FIRST);
typedef void (*print_t)();
print_t b_print = (print_t)dlsym(handle, "print");
std::cout << "Calling print() from libB.dylib" << std::endl;
();
b_print
std::cout << "Calling print() from main" << std::endl;
();
print}
Compiled this and … The expected happened; Ie the globally defined print
is the one from libAShared
since the library is dyn loaded with the binary executable. When calling print2
from libB
- which itself calls print
, the print
from the statically linked libA
is called.
Calling print() from libB.dylib
Hello from libA!
Calling print() from main
Hello from libAShared!
So what’s special about torch and R? Well I don’t know yet :S