There’s room for improving how reticulate handles missing values when converting between R and Python (pandas) data.frames. This documents highlights the current behavior and takes inspiration on Pandas <-> Arrow casts to propose improvements.
Current behavior
The following data.frame contains NA’s in the 4 most used data types in R. We use tibble::as_tibble() to show the data.frame as it better displays the data types.
# A tibble: 3 × 4
bool int num char
<lgl> <int> <dbl> <chr>
1 NA NA NA <NA>
2 TRUE 1 0.1 a
3 FALSE 2 0.2 b
Now casting into a pandas data frame. We see the following:
p_df <- reticulate::r_to_py(df)
For bool: NA became TRUE which is quite unexpected.
For int: NA became the largest possible integer. This is also unexpected. However pandas default integer data type does not support NA’s and in theory one must cast them to float.
For num: we got a NaN, which is the default missing value in Pandas, even though an experimental pd.NA value.
For char, we get NA as a character, which is also very unlikely to be the best way to handle the conversion.
Getting values back into R
Casting back the pandas data.frame into R, we get:
reticulate::py_to_r(p_df)
bool int num char
1 TRUE NA NA NA
2 TRUE 1 0.1 a
3 FALSE 2 0.2 b
It mostly works, besides that the missing boolean value is lost.
How others do it
reticulate is not the only library that needs to convert tables with such types that happen to contain missing values into pandas data frames. We looked into how Arrow and Polars work in such cases. Both libraries support missing values via an explicit NULL value. Polars is based on Arrow, so there shouldn’t exist many differences compared to Arrow besides some additional handling of NaNs.
import pyarrow as pab = pa.array([None, True, False])i = pa.array([None, 1, 2])n = pa.array([None, 0.1, 0.2])c = pa.array([None, "a", "b"])at = pa.table([b, i, n, c], names=["bool", "int", "num", "char"])
And this is the result of the cast:
p_df = at.to_pandas()p_df
bool int num char
0 None NaN NaN None
1 True 1.0 0.1 a
2 False 2.0 0.2 b
p_df.dtypes
bool object
int float64
num float64
char object
dtype: object
Note that:
bool and char were cast into object types. The object data type in Pandas is used for columns with mixed types. One possible downside of this approach is that NAs become bool after any comparison. It also cast to False (or 0) when you sum a column containing a None.
p_df['bool'] &True
0 False
1 True
2 False
Name: bool, dtype: bool
p_df['bool'] |True
0 False
1 True
2 True
Name: bool, dtype: bool
p_df['bool'].sum()
1
p_df['bool'].isna()
0 True
1 False
2 False
Name: bool, dtype: bool
int has been cast into a ‘float64’ type. This reflects Pandas default approach too, since integer values can’t represent NaNs (the default missing value). This approach seems reasonable for reticulate to consider - specially considering how R is flexible in general with numerics and integers.
num: the default missing value (NaN) is used.
What happens with round trip casts
It seems that from the above it’s hard to get back the same arrow table that was first converted. Let’s try:
One question that came up is what happens if we take the Arrow table (that natively) containing missing values, cast into pandas and then into R. Can reticulate correctly infer data types?
tibble::as_tibble(reticulate::py$p_df)
# A tibble: 3 × 4
bool int num char
<list> <dbl> <dbl> <list>
1 <NULL> NaN NaN <NULL>
2 <lgl [1]> 1 0.1 <chr [1]>
3 <lgl [1]> 2 0.2 <chr [1]>
It seems reasonable, but we don’t simplify the columns types which Arrow does nicely.
What if we get the Pandas table that uses the nullable data types:
reticulate::py$p_df_nullable
Warning in format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :
corrupt data frame: columns will be truncated or padded with NAs
bool int num char
0 None NaN NaN None
1 True 1.0 0.1 a
2 False 2.0 0.2 b
p_df.dtypes
bool object
int float64
num float64
char object
dtype: object
It looks like PySpark is using the same behavior as Arrow (and Polars) which is to use objects for booleans and chars, representing the null with Python None. Integers are converted into floats and then use NaN for representing the missing value.
When going back, the types are simplifed (object -> boolean, object -> string). Differently from Arrow, NaN are kept and not converted into None or the Spark null value.
Can we use the Pandas’s nullable datatypes when casting from Spark? I couldn’t find it, although you can:
And now, you get None instead of NaNs to represent missing values for integers and numeric. Although this is tricky because it would wrongly convert real NaNs. into nulls.
By setting this feature to True, you also get the casts from nullable pandas data types:
Given this analysis, I think it make sense to make the following changes to reticulate:
We should provide an option to use Pandas nullable data types when casting from R to Pandas. And maybe - maybe - this should be the default. Given the current behavior, it seems that this is much safer.
We should support casting from Pandas nullable data types to R.
Similar to how Arrow and Spark works, when converting Pandas object columns to R, we should simply their data type if it only contains None and another scalar type.