4  Polars for Pandas users

Open in Google Colab | Download notebook


# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade polars watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
import numpy as np

import polars as pl
import polars.selectors as cs

import pandas as pd

As of June 2025, Pandas is by far and away the most widely used data frame package for Python. We are using Polars primarily because, in my opinion, the API is more intuitive and therefore easier for beginners and experts alike to use. It is also faster, sometimes much faster. It is, however, important to know about Pandas and how to use it because many of your colleagues use it and many packages you may use do, too.

Therefore, in this part of the lesson, I discuss how to convert a Polars data frame to Pandas, and vice versa. I also provide syntax for doing common tasks in Polars and Pandas. It is also worth reading the section of the Polars user guide comparing Pandas and Polars.

4.1 A sample data frame

For ease of discussion and comparison, we will use a simple data frame that has two categorical columns, 'c1', and 'c2', two quantitative columns as floats, 'q1', and 'q2', and a column, 'i1' of integer values. It also has an identity column, a unique identifier for each row that is useful when converting the data frame to tall format. Note that 'q1' has a null value and 'q2' has a NaN value.

data = dict(
    id=list(range(1, 9)),
    c1=['a']*4 + ['b']*4,
    c2=['c', 'd'] * 4,
    q1=[1.1, 2.2, 3.1, None, 2.9, 1.7, 3.0, 7.3],
    q2=[4.5, 2.3, np.nan, 1.1, 7.8, 2.3, 1.1, 0.8],
    i1=[5, 3, 0, 2, 4, 3, 4, 1],
)

df = pl.DataFrame(data)

# Take a look
df
shape: (8, 6)
id c1 c2 q1 q2 i1
i64 str str f64 f64 i64
1 "a" "c" 1.1 4.5 5
2 "a" "d" 2.2 2.3 3
3 "a" "c" 3.1 NaN 0
4 "a" "d" null 1.1 2
5 "b" "c" 2.9 7.8 4
6 "b" "d" 1.7 2.3 3
7 "b" "c" 3.0 1.1 4
8 "b" "d" 7.3 0.8 1

4.2 From Polars to Pandas and from Pandas to Polars

If you have a Polars data frame, you can directly convert it to a Pandas data frame using the to_pandas(), method. Let’s do that for our data frame.

df.to_pandas()
id c1 c2 q1 q2 i1
0 1 a c 1.1 4.5 5
1 2 a d 2.2 2.3 3
2 3 a c 3.1 NaN 0
3 4 a d NaN 1.1 2
4 5 b c 2.9 7.8 4
5 6 b d 1.7 2.3 3
6 7 b c 3.0 1.1 4
7 8 b d 7.3 0.8 1

Note that the null value because a NaN. All missing data in Pandas are NaN. (Well, not really. You can have an object data data type for a column that permits None values. However, when Pandas reads in data, when there are missing data, it assigns it to be NaN by default.)

Note also that Pandas has an index displayed on the left side of the data frame. In general, we will not use Pandas indexes.

Similarly, if you have a data frame in Pandas, you can convert it to a Polars data frame using the pl.from_pandas() function.

pl.from_pandas(pd.DataFrame(data))
shape: (8, 6)
id c1 c2 q1 q2 i1
i64 str str f64 f64 i64
1 "a" "c" 1.1 4.5 5
2 "a" "d" 2.2 2.3 3
3 "a" "c" 3.1 null 0
4 "a" "d" null 1.1 2
5 "b" "c" 2.9 7.8 4
6 "b" "d" 1.7 2.3 3
7 "b" "c" 3.0 1.1 4
8 "b" "d" 7.3 0.8 1

4.3 Pandas and Polars for common tasks

Below is a table listing common tasks with data frame done using Polars and Pandas.

Description Pandas Polars
Convert dictionary to df pd.DataFrame(data) pl.DataFrame(data)
Make 2D Numpy array into df pd.DataFrame(my_ar, columns=['col1', 'col2', 'col3']) pl.DataFrame(my_ar, schema=['col1', 'col2', 'col3']', orient='row')
Read from CSV pd.read_csv(file_name) pl.read_csv(file_name)
Lazily read CSV pl.scan_csv(file_name)
Read from Excel pd.read_excel(file_name) pl.read_excel(file_name)
Read from JSON pd.read_json(file_name) pl.read_json(file_name)
Read from HDF5 pd.read_hdf(file_name)
Write CSV df.to_csv(fname, index=False) df.write_csv(fname)
Rename columns df.rename(columns={'c1': 'cat1', 'c2': 'cat2'}) df.rename({'c1': 'cat1', 'c2': 'cat2'})
Get column 'q1' as series df['q1'] or df.get_column('q1') or df.loc[:,'q1'] df['q1'] or df.get_column('q1')
Get column 'q1' as data frame df[['q1']] or df.loc[:,['q1']] df.select('q1')
Get columns 'c1' and 'q2' df[['c1', 'q2']] or df.loc[:, ['c1', 'q2']] df.select('c1', 'q2')
Get columns containing floats df.select_dtypes(float) df.select(cs.float())
Get row 4 df.loc[4, :] df.row(4) or df.row(4, named=True)
Get row 4 as data frame df.loc[[4], :] df[4]
Get row where i1 is 2 df.loc[df['i1']==2, :] df.row(by_predicate=pl.col('i1')==2) or df.filter(pl.col('i1')==2)
Sub df with rows where c1 is 'a' df.loc[df['c1']=='a', :] df.filter(pl.col('c1')=='a')
Sub df where c1 is 'a' and c2 is 'd' df.loc[(df['c1']=='a') & (df['c2']=='d'), :] df.filter((pl.col('c1')=='a') & (pl.col('c2')=='d'))
Iterate over columns of df for col, s in df.items() for s in df
Iterate over rows of df for row_ind, row in df.iterrows() for r in df.iter_rows(named=True)
Group by single column df.groupby('c1') df.group_by('c1')
Group by maintaining order df.groupby('c1', sort=False) df.group_by('c1', maintain_order=True)
Group by multiple columns df.groupby(['c1', 'c2']) df.group_by(['c1', 'c2'])
Iterate over groups for group, subdf in df.groupby('c1') for (group,), subdf in df.group_by('c1')
Iterate over nested groups for (g1, g2), subdf in df.groupby(['c1', 'c2']) for (g1, g2), subdf in df.groupby(['c1', 'c2'])
Group by and apply mean¹ df.groupby('c1').mean(numeric_only=True) df.group_by('c1').mean()
Group by and apply median to one column df.groupby('c1')['q1'].median() df.group_by('c1').agg(pl.col('q1').median())
Group by and apply mean to two columns df.groupby('c1')[['q1', 'q2']].mean() df.group_by('c1').agg(pl.col('q1', 'q2').mean())
Group by and apply custom func to col² df.groupby('c1')['q1'].apply(my_fun) df.group_by('c1').agg(pl.col('q1').map_elements(my_fun, return_dtype=float))
Group by and apply custom func to 2 cols³ df.groupby('c1')[['q1', 'q2']].apply(my_fun) df.group_by('c1').agg(pl.struct(['q1', 'q2']).map_elements(my_fun, return_dtype=float))
Group by and rank within each group df.groupby('c1')['q1'].rank() df.select(pl.col('q1').rank().over('c1'))
Convert to tall format df.melt(value_name='value', var_name='var', id_vars='id') df.unpivot(value_name='value', variable_name='var', index='id')
Pivot tall result above df_tall.pivot(columns='var', index='id').reset_index() df_tall.pivot(on='var', index='id')
Select columns with string in name df.filter(regex='q') or df[df.columns[df.columns.str.contains('q')]] df.select(cs.contains('q'))
Add column of zeros to data frame df['new_col'] = 0 or df.assign(new_col=0) df.with_columns(pl.lit(0).alias('new_col'))
Add a Numpy array as column df['new_col'] = my_array or df.assign(new_col=my_array) df.with_columns(pl.Series(my_array).alias('new_col'))
Multiply two columns; make new column df['q1q2'] = df['q1'] * df['q2'] or df.assign(q1q2=df['q1'] * df['q2'] df.with_columns((pl.col('q1') * pl.col('q2')).alias('q1q2'))
Apply a function to each row making new col⁴ df.assign(new_col=my_fun) df.with_columns(pl.struct(pl.all()).map_elements(my_fun, return_dtype=float).alias('new_col'))
Drop rows with missing data df.dropna() df.drop_nulls()
Sort according to a column df.sort_values(by='i1') df.sort(by='i1')
Inner join two data frames⁵ pd.merge(df, df2, on=shared_columns) df.join(df2, on=shared_columns)
Concatenate data frames vertically pd.concat((df, df2)) pl.concat((df, df2), how='diagonal')
Concatenate data frames horizontally pd.concat((df, df2), axis=1) pl.concat((df, df2), how='horizontal')

Footnotes

  1. Note that in Pandas, NaNs are omitted from calculations like means. In Polars, NaNs are included, and the result will be NaN. However, nulls are not included.
  2. For Pandas, the function my_fun must take an array_like data type (list, Numpy array, Pandas Series, etc.) as input. For Polars, the function my_fun must take a Polars Series as input. It is wise to specify the data type of the output of the function (shown as float in the above example, but can be whatever type my_fun returns). A Pandas example: my_fun = lambda x: np.sum(np.sin(x)). A Polars example: my_fun = lambda s: s.exp().sum().
  3. For Pandas, the function must take a Pandas DataFrame as an argument. For Polars, it must take a Polars Series with a struct data type. A Pandas example: my_fun = lambda df: (np.sin(s['q1']) * s['q2']).sum(). A Polars example: my_fun = lambda s: (s.struct.field('q1').sin() * s.struct.field('q2')).sum()
  4. For Pandas, my_fun must take as its argument a Pandas Series with an index containing the names of the columns of the original data frame. For Polars, my_fun must take as its argument a dictionary with keys given by the names of the columns of the original data frame. The functions may then have the same syntax (though possibly with different type hints). An example: my_fun = lambda r: r['i1'] * np.sin(r['q2']). However, note that in Polars, a null value is treated as None, which means you cannot apply a function to it, multiply by it, etc.
  5. For Polars, the on kwarg for df.join() is required. With Pandas, which columns to join on are inferred based on like-names of columns.

4.4 Hierarchical indexes

Pandas supports hierarchical indexes, called MultiIndexes. This is not supports by Polars. Polars will not read a CSV file with hierarchical indexes. If you have a data set in a CSV file with hierarchical indexes, you can convert it to a CSV file in tall format where the MultiIndex has been converted to columns using the bebi103.utils.unpivot_csv() function. This operation is akin to a df.melt() operation on a data frame with a hierarchical index. You can then read the converted CSV file into Polars and begin working with it.

4.5 Computing environment

%load_ext watermark
%watermark -v -p numpy,pandas,polars,jupyterlab
Python implementation: CPython
Python version       : 3.12.9
IPython version      : 9.1.0

numpy     : 2.1.3
pandas    : 2.2.3
polars    : 1.27.1
jupyterlab: 4.3.6