Arrow backends
Conecta currently supports three arrow backends:
The default one is pyarrow
, you can change it with return_backend
from conecta import read_sql
table = read_sql(
"postgres://user:password@localhost:5432",
queries=['select l_orderkey from lineitem'],
partition_on='l_orderkey',
partition_num=4,
return_backend='arro3' # <--------- change here
)
Why different arrow backends?
Libraries like pyarrow
or nanoarrow
are implementations of the arrow specification,
which is not tied to a programming language, each have different properties and advantages
that you can take advantage of. If you come across an advantage that is not documented here,
please contribute it!
arro3
The package size is significantly smaller (2.2-2.9MB) than pyarrow
's (26-45MB), depending on the
system.
It is less feature-complete but is perfectly fine if you are just loading data in e.g. Polars. Creating
a polars dataframe from an arro3
table can be measurably faster than pyarrow
in some datasets.
Additionally, releases are typically faster and paired with the latest arrow version.
import timeit
from conecta import read_sql
times = 10
result = timeit.timeit(
"""
t = read_sql(
"postgres://postgres:[email protected]:5400",
queries=['select l_orderkey from lineitem10x'],
partition_on='l_orderkey',
partition_num=4,
return_backend='arro3' # pyarrow
)
df = polars.from_arrow(t)
""", globals=globals(), number=times,
)
print(result / times)
0.4460104392997891s for pyarrow
0.33837970568998571s for arro3
In this benchmark arro3 is ~24% faster than pyarrow.
pyarrow
It Is the most popular one, feature-full and generally available in many environments, it is the default backend for compatability reasons. The main disadvantage is that it is very heavy dependency, and you might not need all the features.
nanoarrow
Returns an ArrayStream
which your application might benefit from.