Skip to content

Arrow backends

Conecta currently supports three arrow backends:

pyarrow, arro3 and nanoarrow

The default one is pyarrow, you can change it with return_backend

from conecta import read_sql

table = read_sql(
    "postgres://user:password@localhost:5432",
    queries=['select l_orderkey from lineitem'],
    partition_on='l_orderkey',
    partition_num=4,
    return_backend='arro3' # <--------- change here
)
Every return backend has a different python package that you need to install in your environment, otherwise an exception will be raised.

Why different arrow backends?

Libraries like pyarrow or nanoarrow are implementations of the arrow specification, which is not tied to a programming language, each have different properties and advantages that you can take advantage of. If you come across an advantage that is not documented here, please contribute it!

arro3

The package size is significantly smaller (2.2-2.9MB) than pyarrow's (26-45MB), depending on the system.

It is less feature-complete but is perfectly fine if you are just loading data in e.g. Polars. Creating a polars dataframe from an arro3 table can be measurably faster than pyarrow in some datasets.

Additionally, releases are typically faster and paired with the latest arrow version.

import timeit
from conecta import read_sql

times = 10
result = timeit.timeit(
    """
t = read_sql(
    "postgres://postgres:[email protected]:5400",
    queries=['select l_orderkey from lineitem10x'],
    partition_on='l_orderkey',
    partition_num=4,
    return_backend='arro3' # pyarrow
)

df = polars.from_arrow(t)
    """, globals=globals(), number=times,
)

print(result / times)

0.4460104392997891s for pyarrow

0.33837970568998571s for arro3

In this benchmark arro3 is ~24% faster than pyarrow.

pyarrow

It Is the most popular one, feature-full and generally available in many environments, it is the default backend for compatability reasons. The main disadvantage is that it is very heavy dependency, and you might not need all the features.

nanoarrow

Returns an ArrayStream which your application might benefit from.