Examples

This notebook contains a few examples on how to use pelage. The idea is to illustrate what the main features with an succession of checks / transformation. We use here a simple example: the MPG dataset, loaded using the seaborn utility function.

Imports

import polars as pl
import seaborn as sns

import pelage as plg

data = pl.DataFrame(sns.load_dataset("mpg"))
data.head()

shape: (5, 9)

mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
f64	i64	f64	f64	i64	f64	i64	str	str
18.0	8	307.0	130.0	3504	12.0	70	"usa"	"chevrolet chev…
15.0	8	350.0	165.0	3693	11.5	70	"usa"	"buick skylark …
18.0	8	318.0	150.0	3436	11.0	70	"usa"	"plymouth satel…
16.0	8	304.0	150.0	3433	12.0	70	"usa"	"amc rebel sst"
17.0	8	302.0	140.0	3449	10.5	70	"usa"	"ford torino"

Basic data transformations

In the following example, we perform some basic checks followed by a simple data transformation and finally checking for the presence of outliers.

average_mileage_per_zone = (
    data.pipe(plg.has_no_nulls, ["origin", "cylinders", "model_year"])
    .pipe(plg.accepted_range, {"cylinders": (3, 8)})
    .pipe(plg.accepted_values, {"origin": ["usa", "europe", "japan"]})
    .filter(pl.col("model_year") >= 80)
    .group_by("origin", "cylinders", "model_year")
    .agg(
        n_distinct_models=pl.col("name").n_unique(),
        avg_mpg=pl.col("mpg").mean(),
    )
    .filter(pl.col("n_distinct_models") >= 3)
    .pipe(plg.column_is_within_n_std, ("avg_mpg", 3))
)
average_mileage_per_zone

shape: (10, 5)

origin	cylinders	model_year	n_distinct_models	avg_mpg
str	i64	i64	u32	f64
"usa"	6	81	4	20.925
"usa"	4	80	6	27.05
"usa"	4	82	17	29.647059
"japan"	4	81	10	34.59
"japan"	4	80	11	36.709091
"europe"	4	81	3	31.866667
"usa"	6	82	3	28.333333
"usa"	4	81	7	30.95
"europe"	4	80	8	37.4
"japan"	4	82	9	34.888889

Focus on the errors

Error message

When the check fails, a PolarsAssertError exception is raised. The error message tends to provide a summarized view of the problem that occurred during the check.

(
    data.pipe(
        plg.accepted_range,
        {"displacement": (50, 300), "horsepower": (50, 200)},
    )
)
# Generate a PolarsAssertError

---------------------------------------------------------------------------
PolarsAssertError                         Traceback (most recent call last)
Cell In[13], line 2
      1 (
----> 2     data.pipe(
      3         plg.accepted_range,
      4         {"displacement": (50, 300), "horsepower": (50, 200)},
      5     )
      6 )
      7 # Generate a PolarsAssertError

File ~/.pyenv/versions/3.10.13/envs/FC3.10/lib/python3.10/site-packages/polars/dataframe/frame.py:5128, in DataFrame.pipe(self, function, *args, **kwargs)
   5063 def pipe(
   5064     self,
   5065     function: Callable[Concatenate[DataFrame, P], T],
   5066     *args: P.args,
   5067     **kwargs: P.kwargs,
   5068 ) -> T:
   5069     """
   5070     Offers a structured way to apply a sequence of user-defined functions (UDFs).
   5071 
   (...)
   5126     └─────┴─────┘
   5127     """
-> 5128     return function(self, *args, **kwargs)

File ~/code/alixtc/pelage/pelage/checks.py:987, in accepted_range(data, items)
    985 out_of_range = data.filter(pl.Expr.or_(*forbidden_ranges))
    986 if not out_of_range.is_empty():
--> 987     raise PolarsAssertError(
    988         out_of_range, "Some values are beyond the acceptable ranges defined"
    989     )
    990 return data

PolarsAssertError: Details
shape: (104, 9)
┌──────┬───────────┬─────────────┬────────────┬───┬─────────────┬────────────┬────────┬────────────┐
│ mpg  ┆ cylinders ┆ displacemen ┆ horsepower ┆ … ┆ acceleratio ┆ model_year ┆ origin ┆ name       │
│ ---  ┆ ---       ┆ t           ┆ ---        ┆   ┆ n           ┆ ---        ┆ ---    ┆ ---        │
│ f64  ┆ i64       ┆ ---         ┆ f64        ┆   ┆ ---         ┆ i64        ┆ str    ┆ str        │
│      ┆           ┆ f64         ┆            ┆   ┆ f64         ┆            ┆        ┆            │
╞══════╪═══════════╪═════════════╪════════════╪═══╪═════════════╪════════════╪════════╪════════════╡
│ 18.0 ┆ 8         ┆ 307.0       ┆ 130.0      ┆ … ┆ 12.0        ┆ 70         ┆ usa    ┆ chevrolet  │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ chevelle   │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ malibu     │
│ 15.0 ┆ 8         ┆ 350.0       ┆ 165.0      ┆ … ┆ 11.5        ┆ 70         ┆ usa    ┆ buick      │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ skylark    │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ 320        │
│ 18.0 ┆ 8         ┆ 318.0       ┆ 150.0      ┆ … ┆ 11.0        ┆ 70         ┆ usa    ┆ plymouth   │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ satellite  │
│ 16.0 ┆ 8         ┆ 304.0       ┆ 150.0      ┆ … ┆ 12.0        ┆ 70         ┆ usa    ┆ amc rebel  │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ sst        │
│ 17.0 ┆ 8         ┆ 302.0       ┆ 140.0      ┆ … ┆ 10.5        ┆ 70         ┆ usa    ┆ ford       │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ torino     │
│ …    ┆ …         ┆ …           ┆ …          ┆ … ┆ …           ┆ …          ┆ …      ┆ …          │
│ 18.5 ┆ 8         ┆ 360.0       ┆ 150.0      ┆ … ┆ 13.0        ┆ 79         ┆ usa    ┆ chrysler   │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ lebaron    │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ town @     │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ country …  │
│ 23.0 ┆ 8         ┆ 350.0       ┆ 125.0      ┆ … ┆ 17.4        ┆ 79         ┆ usa    ┆ cadillac   │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ eldorado   │
│ 44.3 ┆ 4         ┆ 90.0        ┆ 48.0       ┆ … ┆ 21.7        ┆ 80         ┆ europe ┆ vw rabbit  │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ c (diesel) │
│ 43.4 ┆ 4         ┆ 90.0        ┆ 48.0       ┆ … ┆ 23.7        ┆ 80         ┆ europe ┆ vw dasher  │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ (diesel)   │
│ 26.6 ┆ 8         ┆ 350.0       ┆ 105.0      ┆ … ┆ 19.0        ┆ 81         ┆ usa    ┆ oldsmobile │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ cutlass ls │
└──────┴───────────┴─────────────┴────────────┴───┴─────────────┴────────────┴────────┴────────────┘
Error with the DataFrame passed to the check function:
-->Some values are beyond the acceptable ranges defined

Investigating the cause of the failure

In addition to help the user better understand the root cause of the check failure, the error object also possesses as df attribute that can contains the identified values causing the check to fail.

Here is how to simply retrieve it without adding a try/except block. This allows us to print the error in a string format.

import sys

error = sys.last_value

print(error)

You can then manipulate a subset dataframe containing the elements that triggered the exception. Here we do a few manipulations to determine what are the values that are outside the specified boundaries as well as their relative importance within the dataset.

(
    pl.DataFrame(error.df)  # This is only here to obtain syntax highlighting
    .select(pl.col("displacement", "horsepower"))
    .describe()
)

shape: (9, 3)

statistic	displacement	horsepower
str	f64	f64
"count"	104.0	104.0
"null_count"	0.0	0.0
"mean"	334.221154	154.278846
"std"	74.472899	37.102968
"min"	68.0	46.0
"25%"	305.0	140.0
"50%"	350.0	150.0
"75%"	360.0	175.0
"max"	455.0	230.0