Pelage: Defensive analysis for Polars
  • Get started
  • API Reference
  • Examples
  • Coming from dbt
  • Git

On this page

  • Imports
  • Basic data transformations
  • Focus on the errors
    • Error message
    • Investigating the cause of the failure

Examples

This notebook contains a few examples on how to use pelage. The idea is to illustrate what the main features with an succession of checks / transformation. We use here a simple example: the MPG dataset, loaded using the seaborn utility function.

Imports

import polars as pl
import seaborn as sns

import pelage as plg

data = pl.DataFrame(sns.load_dataset("mpg"))
data.head()
shape: (5, 9)
mpg cylinders displacement horsepower weight acceleration model_year origin name
f64 i64 f64 f64 i64 f64 i64 str str
18.0 8 307.0 130.0 3504 12.0 70 "usa" "chevrolet chev…
15.0 8 350.0 165.0 3693 11.5 70 "usa" "buick skylark …
18.0 8 318.0 150.0 3436 11.0 70 "usa" "plymouth satel…
16.0 8 304.0 150.0 3433 12.0 70 "usa" "amc rebel sst"
17.0 8 302.0 140.0 3449 10.5 70 "usa" "ford torino"

Basic data transformations

In the following example, we perform some basic checks followed by a simple data transformation and finally checking for the presence of outliers.

average_mileage_per_zone = (
    data.pipe(plg.has_no_nulls, ["origin", "cylinders", "model_year"])
    .pipe(plg.accepted_range, {"cylinders": (3, 8)})
    .pipe(plg.accepted_values, {"origin": ["usa", "europe", "japan"]})
    .filter(pl.col("model_year") >= 80)
    .group_by("origin", "cylinders", "model_year")
    .agg(
        n_distinct_models=pl.col("name").n_unique(),
        avg_mpg=pl.col("mpg").mean(),
    )
    .filter(pl.col("n_distinct_models") >= 3)
    .pipe(plg.column_is_within_n_std, ("avg_mpg", 3))
)
average_mileage_per_zone
shape: (10, 5)
origin cylinders model_year n_distinct_models avg_mpg
str i64 i64 u32 f64
"usa" 6 81 4 20.925
"usa" 4 80 6 27.05
"usa" 4 82 17 29.647059
"japan" 4 81 10 34.59
"japan" 4 80 11 36.709091
"europe" 4 81 3 31.866667
"usa" 6 82 3 28.333333
"usa" 4 81 7 30.95
"europe" 4 80 8 37.4
"japan" 4 82 9 34.888889

Focus on the errors

Error message

When the check fails, a PolarsAssertError exception is raised. The error message tends to provide a summarized view of the problem that occurred during the check.

(
    data.pipe(
        plg.accepted_range,
        {"displacement": (50, 300), "horsepower": (50, 200)},
    )
)
# Generate a PolarsAssertError
---------------------------------------------------------------------------
PolarsAssertError                         Traceback (most recent call last)
Cell In[13], line 2
      1 (
----> 2     data.pipe(
      3         plg.accepted_range,
      4         {"displacement": (50, 300), "horsepower": (50, 200)},
      5     )
      6 )
      7 # Generate a PolarsAssertError

File ~/.pyenv/versions/3.10.13/envs/FC3.10/lib/python3.10/site-packages/polars/dataframe/frame.py:5128, in DataFrame.pipe(self, function, *args, **kwargs)
   5063 def pipe(
   5064     self,
   5065     function: Callable[Concatenate[DataFrame, P], T],
   5066     *args: P.args,
   5067     **kwargs: P.kwargs,
   5068 ) -> T:
   5069     """
   5070     Offers a structured way to apply a sequence of user-defined functions (UDFs).
   5071 
   (...)
   5126     └─────┴─────┘
   5127     """
-> 5128     return function(self, *args, **kwargs)

File ~/code/alixtc/pelage/pelage/checks.py:987, in accepted_range(data, items)
    985 out_of_range = data.filter(pl.Expr.or_(*forbidden_ranges))
    986 if not out_of_range.is_empty():
--> 987     raise PolarsAssertError(
    988         out_of_range, "Some values are beyond the acceptable ranges defined"
    989     )
    990 return data

PolarsAssertError: Details
shape: (104, 9)
┌──────┬───────────┬─────────────┬────────────┬───┬─────────────┬────────────┬────────┬────────────┐
│ mpg  ┆ cylinders ┆ displacemen ┆ horsepower ┆ … ┆ acceleratio ┆ model_year ┆ origin ┆ name       │
│ ---  ┆ ---       ┆ t           ┆ ---        ┆   ┆ n           ┆ ---        ┆ ---    ┆ ---        │
│ f64  ┆ i64       ┆ ---         ┆ f64        ┆   ┆ ---         ┆ i64        ┆ str    ┆ str        │
│      ┆           ┆ f64         ┆            ┆   ┆ f64         ┆            ┆        ┆            │
╞══════╪═══════════╪═════════════╪════════════╪═══╪═════════════╪════════════╪════════╪════════════╡
│ 18.0 ┆ 8         ┆ 307.0       ┆ 130.0      ┆ … ┆ 12.0        ┆ 70         ┆ usa    ┆ chevrolet  │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ chevelle   │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ malibu     │
│ 15.0 ┆ 8         ┆ 350.0       ┆ 165.0      ┆ … ┆ 11.5        ┆ 70         ┆ usa    ┆ buick      │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ skylark    │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ 320        │
│ 18.0 ┆ 8         ┆ 318.0       ┆ 150.0      ┆ … ┆ 11.0        ┆ 70         ┆ usa    ┆ plymouth   │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ satellite  │
│ 16.0 ┆ 8         ┆ 304.0       ┆ 150.0      ┆ … ┆ 12.0        ┆ 70         ┆ usa    ┆ amc rebel  │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ sst        │
│ 17.0 ┆ 8         ┆ 302.0       ┆ 140.0      ┆ … ┆ 10.5        ┆ 70         ┆ usa    ┆ ford       │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ torino     │
│ …    ┆ …         ┆ …           ┆ …          ┆ … ┆ …           ┆ …          ┆ …      ┆ …          │
│ 18.5 ┆ 8         ┆ 360.0       ┆ 150.0      ┆ … ┆ 13.0        ┆ 79         ┆ usa    ┆ chrysler   │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ lebaron    │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ town @     │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ country …  │
│ 23.0 ┆ 8         ┆ 350.0       ┆ 125.0      ┆ … ┆ 17.4        ┆ 79         ┆ usa    ┆ cadillac   │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ eldorado   │
│ 44.3 ┆ 4         ┆ 90.0        ┆ 48.0       ┆ … ┆ 21.7        ┆ 80         ┆ europe ┆ vw rabbit  │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ c (diesel) │
│ 43.4 ┆ 4         ┆ 90.0        ┆ 48.0       ┆ … ┆ 23.7        ┆ 80         ┆ europe ┆ vw dasher  │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ (diesel)   │
│ 26.6 ┆ 8         ┆ 350.0       ┆ 105.0      ┆ … ┆ 19.0        ┆ 81         ┆ usa    ┆ oldsmobile │
│      ┆           ┆             ┆            ┆   ┆             ┆            ┆        ┆ cutlass ls │
└──────┴───────────┴─────────────┴────────────┴───┴─────────────┴────────────┴────────┴────────────┘
Error with the DataFrame passed to the check function:
-->Some values are beyond the acceptable ranges defined

Investigating the cause of the failure

In addition to help the user better understand the root cause of the check failure, the error object also possesses as df attribute that can contains the identified values causing the check to fail.

Here is how to simply retrieve it without adding a try/except block. This allows us to print the error in a string format.

import sys

error = sys.last_value

print(error)

You can then manipulate a subset dataframe containing the elements that triggered the exception. Here we do a few manipulations to determine what are the values that are outside the specified boundaries as well as their relative importance within the dataset.

(
    pl.DataFrame(error.df)  # This is only here to obtain syntax highlighting
    .select(pl.col("displacement", "horsepower"))
    .describe()
)
shape: (9, 3)
statistic displacement horsepower
str f64 f64
"count" 104.0 104.0
"null_count" 0.0 0.0
"mean" 334.221154 154.278846
"std" 74.472899 37.102968
"min" 68.0 46.0
"25%" 305.0 140.0
"50%" 350.0 150.0
"75%" 360.0 175.0
"max" 455.0 230.0