安装Polars

安装Polars #

Polars是一个库,安装起来就像调用相应编程语言的包管理器一样简单。

pip install polars

#或者对于那些不支持高级矢量扩展指令集2(AVX2)的旧CPU
pip install polars-lts-cpu
cargo add polars -F lazy

# Or Cargo.toml
[dependencies]
polars = { version = "x", features = ["lazy", ...]}

大索引 #

默认情况下,Polars dataframes的行数限制为2^32(约43亿)行。通过启用大索引扩展功能,可将此限制提升至2^64(约1800京)行:

pip install polars-u64-idx
cargo add polars -F bigidx

# Or Cargo.toml
[dependencies]
polars = { version = "x", features = ["bigidx", ...] }

旧款CPU #

在不支持高级矢量扩展指令集( AVX)的旧款 CPU 上为 Python 安装 Polars,请运行:

pip install polars-lts-cpu

导入polars #

要使用polars库,只需将其导入到你的项目中即可:

import polars as pl
use polars::prelude::*;

特性标志 #

通过使用上述命令,你可以将 Polars 的核心部分安装到你的系统上。然而,根据你的使用场景,你可能还需要安装一些可选的依赖项。将这些设置为可选的目的是尽量减少占用空间。根据编程语言的不同,相应的标志也有所不同。在整个用户指南中,当所使用的某项功能需要额外的依赖项时,将会提别提醒。

Python #

# 示例
pip install 'polars[numpy,fsspec]'

All #

标志 说明
all 安装所有可选的依赖项。

GPU #

标志 说明
gpu 在英伟达(NVIDIA)图形处理器(GPU)上运行查询。

说明

有关更详细的说明和先决条件,请参阅 GPU支持相关内容。

互操作性 #

标志 说明
pandas Convert data to and from pandas dataframes/series.
numpy Convert data to and from NumPy arrays.
pyarrow Convert data to and from PyArrow tables/arrays.
pydantic Convert data from Pydantic models to Polars.

Excel #

标志 说明
calamine Read from Excel files with the calamine engine.
openpyxl Read from Excel files with the openpyxl engine.
xlsx2csv Read from Excel files with the xlsx2csv engine.
xlsxwriter Write to Excel files with the XlsxWriter engine.
excel Install all supported Excel engines.

数据库 #

标志 说明
adbc Read from and write to databases with the Arrow Database Connectivity (ADBC) engine.
connectorx Read from databases with the ConnectorX engine.
sqlalchemy Write to databases with the SQLAlchemy engine.
database Install all supported database engines.

#

标志 说明
fsspec Read from and write to remote file systems.

其他I/O #

标志 说明
deltalake Read from and write to Delta tables.
iceberg Read from Apache Iceberg tables.

其他 #

标志 说明
async Collect LazyFrames asynchronously.
cloudpickle Serialize user-defined functions.
graph Visualize LazyFrames as a graph.
plot Plot dataframes through the plot namespace.
style Style dataframes through the style namespace.
timezone Timezone support.仅使用Windows时才需要

Rust #

# Cargo.toml
[dependencies]
polars = { version = "0.26.1", features = ["lazy", "temporal", "describe", "json", "parquet", "dtype-datetime"] }

可选择启用的功能如下:

  • 额外的数据类型:
    • dtype-date
    • dtype-datetime
    • dtype-time
    • dtype-duration
    • dtype-i8
    • dtype-i16
    • dtype-u8
    • dtype-u16
    • dtype-categorical
    • dtype-struct
  • lazy - Lazy API:
    • regex - 在列选择中使用正则表达式.
    • dot_diagram - 根据惰性逻辑计划创建点图。
  • sql - 将 SQL 查询传递给 Polars。
  • streaming - 能够处理比内存容量更大的数据集。
  • random - 生成包含随机采样值的数组
  • ndarray- 将DataFrame(数据框)转换为ndarray(多维数组)
  • temporal - 针对时间数据类型在 Chrono(时间库)和 Polars(数据处理库)之间进行转换
  • timezones - 激活时区支持。
  • strings - Extra string utilities for StringChunked:
    • string_pad - for pad_start, pad_end, zfill.
    • string_to_integer - for parse_int.
  • object - Support for generic ChunkedArrays called ObjectChunked<T> (generic over T). These are downcastable from Series through the Any trait.
  • 性能相关:
    • nightly - Several nightly only features such as SIMD and specialization.
    • performant - more fast paths, slower compile times.
    • bigidx - Activate this feature if you expect » $2^{32}$ rows. This allows polars to scale up way beyond that by using u64 as an index. Polars will be a bit slower with this feature activated as many data structures are less cache efficient.
    • cse - Activate common subplan elimination optimization.
  • IO相关:
    • serde - Support for serde serialization and deserialization. Can be used for JSON and more serde supported serialization formats.
    • serde-lazy - Support for serde serialization and deserialization. Can be used for JSON and more serde supported serialization formats.
    • parquet - Read Apache Parquet format.
    • json - JSON serialization.
    • ipc - Arrow’s IPC format serialization.
    • decompress - Automatically infer compression of csvs and decompress them. Supported compressions:
      • gzip
      • zlib
      • zstd
  • Dataframe操作:
    • dynamic_group_by - Group by based on a time window instead of predefined keys. Also activates rolling window group by operations.
    • sort_multiple - Allow sorting a dataframe on multiple columns.
    • rows - Create dataframe from rows and extract rows from dataframes. Also activates pivot and transpose operations.
    • join_asof - Join ASOF, to join on nearest keys instead of exact equality match.
    • cross_join - Create the Cartesian product of two dataframes.
    • semi_anti_join - SEMI and ANTI joins.
    • row_hash - Utility to hash dataframe rows to UInt64Chunked.
    • diagonal_concat - Diagonal concatenation thereby combining different schemas.
    • dataframe_arithmetic - Arithmetic between dataframes and other dataframes or series.
    • partition_by - Split into multiple dataframes partitioned by groups.
  • Series/表达式操作:
    • is_in - Check for membership in Series.
    • zip_with - Zip two Series / ChunkedArrays.
    • round_series - round underlying float types of series.
    • repeat_by - Repeat element in an array a number of times specified by another array.
    • is_first_distinct - Check if element is first unique value.
    • is_last_distinct - Check if element is last unique value.
    • checked_arithmetic - checked arithmetic returning None on invalid operations.
    • dot_product - Dot/inner product on series and expressions.
    • concat_str - Concatenate string data in linear time.
    • reinterpret - Utility to reinterpret bits to signed/unsigned.
    • take_opt_iter - Take from a series with Iterator<Item=Option<usize>>.
    • mode - Return the most frequently occurring value(s).
    • cum_agg - cum_sum, cum_min, and cum_max, aggregations.
    • rolling_window - rolling window functions, like rolling_mean.
    • interpolate - Interpolate None values.
    • extract_jsonpath - Run jsonpath queries on StringChunked.
    • list - List utils:
      • list_gather - take sublist by multiple indices.
    • rank - Ranking algorithms.
    • moment - Kurtosis and skew statistics.
    • ewma - Exponential moving average windows.
    • abs - Get absolute values of series.
    • arange - Range operation on series.
    • product - Compute the product of a series.
    • diff - diff operation.
    • pct_change - Compute change percentages.
    • unique_counts - Count unique values in expressions.
    • log - Logarithms for series.
    • list_to_struct - Convert List to Struct data types.
    • list_count - Count elements in lists.
    • list_eval - Apply expressions over list elements.
    • cumulative_eval - Apply expressions over cumulatively increasing windows.
    • arg_where - Get indices where condition holds.
    • search_sorted - Find indices where elements should be inserted to maintain order.
    • offset_by - Add an offset to dates that take months and leap years into account.
    • trigonometry - 三角函数.
    • sign - 计算一个序列中每个元素的符号(正、负或零)。
    • propagate_nans - NaN-propagating min/max aggregations.
  • Dataframe美化格式化:
    • fmt - 激活Dataframe格式化功能.
logo