Structs #
The data type Struct
is a composite data type that can store multiple fields in a single column.
!!! tip “Python analogy”
For Python users, the data type `Struct` is kind of like a Python
dictionary. Even better, if you are familiar with Python typing, you can think of the data type
`Struct` as `typing.TypedDict`.
In this page of the user guide we will see situations in which the data type Struct
arises, we
will understand why it does arise, and we will see how to work with Struct
values.
Let’s start with a dataframe that captures the average rating of a few movies across some states in the US:
{{code_block(‘user-guide/expressions/structs’,‘ratings_df’,[‘DataFrame’])}}
--8<-- "python/user-guide/expressions/structs.py:ratings_df"
Encountering the data type Struct
#
A common operation that will lead to a Struct
column is the ever so popular value_counts
function that is commonly used in exploratory data analysis. Checking the number of times a state
appears in the data is done as so:
{{code_block(‘user-guide/expressions/structs’,‘state_value_counts’,[‘value_counts’])}}
--8<-- "python/user-guide/expressions/structs.py:state_value_counts"
Quite unexpected an output, especially if coming from tools that do not have such a data type. We’re
not in peril, though. To get back to a more familiar output, all we need to do is use the function
unnest
on the Struct
column:
{{code_block(‘user-guide/expressions/structs’,‘struct_unnest’,[‘unnest’])}}
--8<-- "python/user-guide/expressions/structs.py:struct_unnest"
The function unnest
will turn each field of the Struct
into its own column.
!!! note “Why value_counts
returns a Struct
”
Polars expressions always operate on a single series and return another series.
`Struct` is the data type that allows us to provide multiple columns as input to an expression, or to output multiple columns from an expression.
Thus, we can use the data type `Struct` to specify each value and its count when we use `value_counts`.
Inferring the data type Struct
from dictionaries
#
When building series or dataframes, Polars will convert dictionaries to the data type Struct
:
{{code_block(‘user-guide/expressions/structs’,‘series_struct’,[‘Series’])}}
--8<-- "python/user-guide/expressions/structs.py:series_struct"
The number of fields, their names, and their types, are inferred from the first dictionary seen.
Subsequent incongruences can result in null
values or in errors:
{{code_block(‘user-guide/expressions/structs’,‘series_struct_error’,[‘Series’])}}
--8<-- "python/user-guide/expressions/structs.py:series_struct_error"
Extracting individual values of a Struct
#
Let’s say that we needed to obtain just the field "Movie"
from the Struct
in the series that we
created above. We can use the function field
to do so:
{{code_block(‘user-guide/expressions/structs’,‘series_struct_extract’,[‘struct.field’])}}
--8<-- "python/user-guide/expressions/structs.py:series_struct_extract"
Renaming individual fields of a Struct
#
What if we need to rename individual fields of a Struct
column? We use the function
rename_fields
:
{{code_block(‘user-guide/expressions/structs’,‘series_struct_rename’,[‘struct.rename_fields’])}}
--8<-- "python/user-guide/expressions/structs.py:series_struct_rename"
To be able to actually see that the field names were change, we will create a dataframe where the
only column is the result and then we use the function unnest
so that each field becomes its own
column. The column names will reflect the renaming operation we just did:
{{code_block(‘user-guide/expressions/structs’,‘struct-rename-check’,[‘struct.rename_fields’])}}
--8<-- "python/user-guide/expressions/structs.py:struct-rename-check"
Practical use-cases of Struct
columns
#
Identifying duplicate rows #
Let’s get back to the ratings
data. We want to identify cases where there are duplicates at a
“Movie” and “Theatre” level.
This is where the data type Struct
shines:
{{code_block(‘user-guide/expressions/structs’,‘struct_duplicates’,[‘is_duplicated’, ‘struct’])}}
--8<-- "python/user-guide/expressions/structs.py:struct_duplicates"
We can identify the unique cases at this level also with is_unique
!
Multi-column ranking #
Suppose, given that we know there are duplicates, we want to choose which rating gets a higher priority. We can say that the column “Count” is the most important, and if there is a tie in the column “Count” then we consider the column “Avg_Rating”.
We can then do:
{{code_block(‘user-guide/expressions/structs’,‘struct_ranking’,[‘is_duplicated’, ‘struct’])}}
--8<-- "python/user-guide/expressions/structs.py:struct_ranking"
That’s a pretty complex set of requirements done very elegantly in Polars! To learn more about the
function over
, used above,
see the user guide section on window functions.
Using multiple columns in a single expression #
As mentioned earlier, the data type Struct
is also useful if you need to pass multiple columns as
input to an expression. As an example, suppose we want to compute
the Ackermann function on two columns of a
dataframe. There is no way of composing Polars expressions to compute the Ackermann function1, so
we define a custom function:
{{code_block(‘user-guide/expressions/structs’, ‘ack’, [])}}
--8<-- "python/user-guide/expressions/structs.py:ack"
Now, to compute the values of the Ackermann function on those arguments, we start by creating a
Struct
with fields m
and n
and then use the function map_elements
to apply the function
ack
to each value:
{{code_block(‘user-guide/expressions/structs’,‘struct-ack’,[], [‘map_elements’], [])}}
--8<-- "python/user-guide/expressions/structs.py:struct-ack"
-
To say that something cannot be done is quite a bold claim. If you prove us wrong, please let us know! ↩︎