Reticulate handling of NA’s

Author

Daniel Falbel

Published

July 26, 2023

There’s room for improving how reticulate handles missing values when converting between R and Python (pandas) data.frames. This documents highlights the current behavior and takes inspiration on Pandas <-> Arrow casts to propose improvements.

Current behavior

The following data.frame contains NA’s in the 4 most used data types in R. We use tibble::as_tibble() to show the data.frame as it better displays the data types.

df <- data.frame(
  bool = c(NA, TRUE, FALSE),
  int = c(NA, 1L, 2L),
  num = c(NA, 0.1, 0.2),
  char = c(NA, "a", "b")
)
tibble::as_tibble(df)

# A tibble: 3 × 4
  bool    int   num char 
  <lgl> <int> <dbl> <chr>
1 NA       NA  NA   <NA> 
2 TRUE      1   0.1 a    
3 FALSE     2   0.2 b

Now casting into a pandas data frame. We see the following:

p_df <- reticulate::r_to_py(df)

For bool: NA became TRUE which is quite unexpected.
For int: NA became the largest possible integer. This is also unexpected. However pandas default integer data type does not support NA’s and in theory one must cast them to float.
For num: we got a NaN, which is the default missing value in Pandas, even though an experimental pd.NA value.
For char, we get NA as a character, which is also very unlikely to be the best way to handle the conversion.

Getting values back into R

Casting back the pandas data.frame into R, we get:

reticulate::py_to_r(p_df)

   bool int num char
1  TRUE  NA  NA   NA
2  TRUE   1 0.1    a
3 FALSE   2 0.2    b

It mostly works, besides that the missing boolean value is lost.

How others do it

reticulate is not the only library that needs to convert tables with such types that happen to contain missing values into pandas data frames. We looked into how Arrow and Polars work in such cases. Both libraries support missing values via an explicit NULL value. Polars is based on Arrow, so there shouldn’t exist many differences compared to Arrow besides some additional handling of NaNs.

import pyarrow as pa

b = pa.array([None, True, False])
i = pa.array([None, 1, 2])
n = pa.array([None, 0.1, 0.2])
c = pa.array([None, "a", "b"])

at = pa.table([b, i, n, c], names=["bool", "int", "num", "char"])

And this is the result of the cast:

p_df = at.to_pandas()
p_df

    bool  int  num  char
0   None  NaN  NaN  None
1   True  1.0  0.1     a
2  False  2.0  0.2     b

p_df.dtypes

bool     object
int     float64
num     float64
char     object
dtype: object

Note that:

bool and char were cast into object types. The object data type in Pandas is used for columns with mixed types. One possible downside of this approach is that NAs become bool after any comparison. It also cast to False (or 0) when you sum a column containing a None.
```
p_df['bool'] & True
```
```
0    False
1     True
2    False
Name: bool, dtype: bool
```
```
p_df['bool'] | True
```
```
0    False
1     True
2     True
Name: bool, dtype: bool
```
```
p_df['bool'].sum()
```
```
1
```
```
p_df['bool'].isna()
```
```
0     True
1    False
2    False
Name: bool, dtype: bool
```
int has been cast into a ‘float64’ type. This reflects Pandas default approach too, since integer values can’t represent NaNs (the default missing value). This approach seems reasonable for reticulate to consider - specially considering how R is flexible in general with numerics and integers.
num: the default missing value (NaN) is used.

What happens with round trip casts

It seems that from the above it’s hard to get back the same arrow table that was first converted. Let’s try:

pa.Table.from_pandas(p_df)

pyarrow.Table
bool: bool
int: double
num: double
char: string
----
bool: [[null,true,false]]
int: [[null,1,2]]
num: [[null,0.1,0.2]]
char: [[null,"a","b"]]

This works quite fine. The only information that has been lost is the data type of the integer column, that has been transformed into a float.

TODO: figure out how arrow does this. Does it walk trough pandas object columns, and if it’s constant besides the NULL, it uses that data type?

Using Pandas nullable data types

Arrow supports using Pandas nullable data types with:

import pandas as pd
dtype_mapping = {
    pa.int8(): pd.Int8Dtype(),
    pa.int16(): pd.Int16Dtype(),
    pa.int32(): pd.Int32Dtype(),
    pa.int64(): pd.Int64Dtype(),
    pa.uint8(): pd.UInt8Dtype(),
    pa.uint16(): pd.UInt16Dtype(),
    pa.uint32(): pd.UInt32Dtype(),
    pa.uint64(): pd.UInt64Dtype(),
    pa.bool_(): pd.BooleanDtype(),
    pa.float32(): pd.Float32Dtype(),
    pa.float64(): pd.Float64Dtype(),
    pa.string(): pd.StringDtype(),
}

p_df_nullable = at.to_pandas(types_mapper=dtype_mapping.get)
p_df_nullable

    bool   int   num  char
0   <NA>  <NA>  <NA>  <NA>
1   True     1   0.1     a
2  False     2   0.2     b

p_df_nullable.dtypes

bool    boolean
int       Int64
num     Float64
char     string
dtype: object

In such cases, the round trip cast also works perfectly:

pa.Table.from_pandas(p_df_nullable)

pyarrow.Table
bool: bool
int: int64
num: double
char: string
----
bool: [[null,true,false]]
int: [[null,1,2]]
num: [[null,0.1,0.2]]
char: [[null,"a","b"]]

Arrow -> Pandas -> R?

One question that came up is what happens if we take the Arrow table (that natively) containing missing values, cast into pandas and then into R. Can reticulate correctly infer data types?

tibble::as_tibble(reticulate::py$p_df)

# A tibble: 3 × 4
  bool        int   num char     
  <list>    <dbl> <dbl> <list>   
1 <NULL>      NaN NaN   <NULL>   
2 <lgl [1]>     1   0.1 <chr [1]>
3 <lgl [1]>     2   0.2 <chr [1]>

It seems reasonable, but we don’t simplify the columns types which Arrow does nicely.

What if we get the Pandas table that uses the nullable data types:

reticulate::py$p_df_nullable

Warning in format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :
corrupt data frame: columns will be truncated or padded with NAs

                                                            bool
1 <BooleanArray>\n[<NA>, True, False]\nLength: 3, dtype: boolean
2                                                           <NA>
3                                                           <NA>
                                                    int
1 <IntegerArray>\n[<NA>, 1, 2]\nLength: 3, dtype: Int64
2                                                  <NA>
3                                                  <NA>
                                                           num
1 <FloatingArray>\n[<NA>, 0.1, 0.2]\nLength: 3, dtype: Float64
2                                                         <NA>
3                                                         <NA>
                                                       char
1 <StringArray>\n[<NA>, 'a', 'b']\nLength: 3, dtype: string
2                                                      <NA>
3                                                      <NA>

Doesn’t work.

Polars

We don’t expect many differences in behavior between Arrow and Polars, so we just quickly print the conversion results:

import polars as pl
pl_df = pl.DataFrame({
  'bool': [None, True, False],
  'int':  [None, 1, 2],
  'num':  [None, 0.1, 0.2],
  'char': [None, "a", "b"],
})
print(pl_df)

shape: (3, 4)
┌───────┬──────┬──────┬──────┐
│ bool  ┆ int  ┆ num  ┆ char │
│ ---   ┆ ---  ┆ ---  ┆ ---  │
│ bool  ┆ i64  ┆ f64  ┆ str  │
╞═══════╪══════╪══════╪══════╡
│ null  ┆ null ┆ null ┆ null │
│ true  ┆ 1    ┆ 0.1  ┆ a    │
│ false ┆ 2    ┆ 0.2  ┆ b    │
└───────┴──────┴──────┴──────┘

Converting into pandas

p_df = pl_df.to_pandas()
p_df

    bool  int  num  char
0   None  NaN  NaN  None
1   True  1.0  0.1     a
2  False  2.0  0.2     b

p_df.dtypes

bool     object
int     float64
num     float64
char     object
dtype: object

And getting back into polars:

print(pl.DataFrame(p_df))

shape: (3, 4)
┌───────┬──────┬──────┬──────┐
│ bool  ┆ int  ┆ num  ┆ char │
│ ---   ┆ ---  ┆ ---  ┆ ---  │
│ bool  ┆ f64  ┆ f64  ┆ str  │
╞═══════╪══════╪══════╪══════╡
│ null  ┆ null ┆ null ┆ null │
│ true  ┆ 1.0  ┆ 0.1  ┆ a    │
│ false ┆ 2.0  ┆ 0.2  ┆ b    │
└───────┴──────┴──────┴──────┘

Same as with Arrow.

PySpark

We also looked at how Spark casts its DataFrames, that supports nullable data types into Pandas data frames and back.

import findspark
findspark.init()
import pyspark as ps
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

s_df = spark.createDataFrame([
  (None, None, None, None),
  (True, 1, 0.1, "a"),
  (False, 2, 0.2, "b")], 
  ["bool", "int", "num", "char"]
  )
s_df

s_df.head(3)

[Row(bool=None, int=None, num=None, char=None), Row(bool=True, int=1, num=0.1, char='a'), Row(bool=False, int=2, num=0.2, char='b')]

Now collect into a pandas data frame:

p_df = s_df.toPandas()
p_df

    bool  int  num  char
0   None  NaN  NaN  None
1   True  1.0  0.1     a
2  False  2.0  0.2     b

p_df.dtypes

bool     object
int     float64
num     float64
char     object
dtype: object

It looks like PySpark is using the same behavior as Arrow (and Polars) which is to use objects for booleans and chars, representing the null with Python None. Integers are converted into floats and then use NaN for representing the missing value.

Now going back to spark:

s_df = spark.createDataFrame(p_df)
s_df

s_df.head(3)

[Row(bool=None, int=nan, num=nan, char=None), Row(bool=True, int=1.0, num=0.1, char='a'), Row(bool=False, int=2.0, num=0.2, char='b')]

When going back, the types are simplifed (object -> boolean, object -> string). Differently from Arrow, NaN are kept and not converted into None or the Spark null value.

Can we use the Pandas’s nullable datatypes when casting from Spark? I couldn’t find it, although you can:

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.createDataFrame(s_df.toPandas()).head(3)

[Row(bool=None, int=None, num=None, char=None), Row(bool=True, int=1.0, num=0.1, char='a'), Row(bool=False, int=2.0, num=0.2, char='b')]

And now, you get None instead of NaNs to represent missing values for integers and numeric. Although this is tricky because it would wrongly convert real NaNs. into nulls.

By setting this feature to True, you also get the casts from nullable pandas data types:

spark.createDataFrame(p_df_nullable).head(3)

[Row(bool=None, int=None, num=None, char=None), Row(bool=True, int=1, num=0.1, char='a'), Row(bool=False, int=2, num=0.2, char='b')]

Future actions

Given this analysis, I think it make sense to make the following changes to reticulate:

We should provide an option to use Pandas nullable data types when casting from R to Pandas. And maybe - maybe - this should be the default. Given the current behavior, it seems that this is much safer.
We should support casting from Pandas nullable data types to R.
Similar to how Arrow and Spark works, when converting Pandas object columns to R, we should simply their data type if it only contains None and another scalar type.

Session Info

sessionInfo()

R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.11          rstudioapi_0.14      knitr_1.43          
 [4] magrittr_2.0.3       lattice_0.20-45      rlang_1.1.1         
 [7] fastmap_1.1.1        fansi_1.0.4          tools_4.2.2         
[10] grid_4.2.2           xfun_0.39            png_0.1-8           
[13] utf8_1.2.3           cli_3.6.1            htmltools_0.5.4     
[16] yaml_2.3.7           digest_0.6.31        tibble_3.2.1        
[19] lifecycle_1.0.3      Matrix_1.5-1         htmlwidgets_1.6.0   
[22] vctrs_0.6.2          glue_1.6.2           evaluate_0.19       
[25] rmarkdown_2.19       compiler_4.2.2       pillar_1.9.0        
[28] reticulate_1.30-9000 jsonlite_1.8.7       pkgconfig_2.0.3

reticulate::py_config()

python:         /Users/dfalbel/Documents/venv/reticulate/bin/python
libpython:      /Users/dfalbel/.pyenv/versions/3.9.16/lib/libpython3.9.dylib
pythonhome:     /Users/dfalbel/Documents/venv/reticulate:/Users/dfalbel/Documents/venv/reticulate
version:        3.9.16 (main, Dec 20 2022, 15:32:28)  [Clang 14.0.0 (clang-1400.0.29.202)]
numpy:          /Users/dfalbel/Documents/venv/reticulate/lib/python3.9/site-packages/numpy
numpy_version:  1.24.0

NOTE: Python version was forced by RETICULATE_PYTHON

pd.show_versions()


INSTALLED VERSIONS
------------------
commit           : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7
python           : 3.9.16.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 22.5.0
Version          : Darwin Kernel Version 22.5.0: Thu Jun  8 22:22:20 PDT 2023; root:xnu-8796.121.3~7/RELEASE_ARM64_T6000
machine          : arm64
processor        : arm
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.5.2
numpy            : 1.24.0
pytz             : 2022.7
dateutil         : 2.8.2
setuptools       : 58.1.0
pip              : 22.0.4
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.1.2
IPython          : 8.14.0
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : 2023.4.0
gcsfs            : None
matplotlib       : 3.7.2
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 12.0.1
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : 0.9.0
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None
tzdata           : None

pl.show_versions()

--------Version info---------
Polars:              0.18.6
Index type:          UInt32
Platform:            macOS-13.4.1-arm64-arm-64bit
Python:              3.9.16 (main, Dec 20 2022, 15:43:04) 
[Clang 14.0.0 (clang-1400.0.29.202)]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              2023.4.0
matplotlib:          3.7.2
numpy:               1.24.0
pandas:              1.5.2
pyarrow:             12.0.1
pydantic:            <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>

ps.__version__

'3.4.1'