DataSketches experimental adapters for Apache Spark by tomerbd

Share This Article

Sed ut perspiciatis unde.

This repository provies Apache DataSketches experimental adapters for Apache Spark.
Please visit the main website for more DataSketches information.

Quantile Sketches

Like the built-in percentile estimation function (approx_percentile),
this plugin enalbes you to use an alternative function (approx_percentile_ex) to estimate percentiles
in a theoretically-meageable and very compact way:

>> df.show(5) +----------+-------------------+ | Date|Global_active_power| +----------+-------------------+ |2006-12-16| 4.216| |2006-12-16| 5.36| |2006-12-16| 5.374| |2006-12-16| 5.388| |2006-12-16| 3.666| +----------+-------------------+ only showing top 5 rows >>> df.describe().show(5, False) +-------+-------------------+ |summary|Global_active_power| +-------+-------------------+ |count |2049280 | |mean |1.0916150365005453 | |stddev |1.0572941610939872 | |min |0.076 | |max |11.122 | +-------+-------------------+ >>> df.selectExpr("percentile(Global_active_power, 0.95) percentile", "approx_percentile(Global_active_power, 0.95) approx_percentile", "approx_percentile_ex(Global_active_power, 0.95) approx_percentile_ex").show() +----------+-----------------+--------------------+ |percentile|approx_percentile|approx_percentile_ex| +----------+-----------------+--------------------+ | 3.264| 3.264| 3.25| +----------+-----------------+--------------------+

Moreover, this plugin provies functionalities to accumulate quantile summaries for each time interval and
estimate quantile values over specific intervals later just like the Snowflake built-in functions:

>> df.selectExpr(“approx_percentile_estimate(merged, 0.95) percentile”).show()
+———-+
| estimated|
+———-+
| 3.25|
+———-+

>>> df.selectExpr(“approx_pmf_estimate(merged, 4) pmf”).show(1, False)
+————————————————————————————–+
|pmf |
+————————————————————————————–+
|[0.9250280810398008, 0.07003322180158443, 0.004825778691690984, 1.1291846692380381E-4]|
+————————————————————————————–+”>

>>> import pyspark.sql.functions as f
>>> summaries = df.groupBy(f.window("Date", "1 week")).agg(f.expr("approx_percentile_accumulate(Global_active_power) AS summaries"))
>>> summaries.show(3, 50)
+------------------------------------------+--------------------------------------------------+
|                                    window|                                         summaries|
+------------------------------------------+--------------------------------------------------+
|{2006-12-14 09:00:00, 2006-12-21 09:00:00}|[04 01 11 28 0C 00 07 00 AA 1D 00 00 00 00 00 0...|
|{2009-12-03 09:00:00, 2009-12-10 09:00:00}|[04 01 11 28 0C 00 05 00 9E 05 00 00 00 00 00 0...|
|{2009-10-22 09:00:00, 2009-10-29 09:00:00}|[04 01 11 28 0C 00 07 00 60 27 00 00 00 00 00 0...|
+------------------------------------------+--------------------------------------------------+
only showing top 3 rows

# Correct percentile of the `Global_active_power` column
scala> df.where("Date between '2007-06-01' and '2010-01-01'").selectExpr("percentile(Global_active_power, 0.95) correct").show()
+-------+
|correct|
+-------+
|  3.236|
+-------+

# Estimated percentile of the `Global_active_power` column
>>> df = summaries.where("window.start > '2007-06-01' and window.end < '2010-01-01'").selectExpr("approx_percentile_combine(summaries) merged")
>>> df.selectExpr("approx_percentile_estimate(merged, 0.95) percentile").show()
+----------+
| estimated|
+----------+
|      3.25|
+----------+

>>> df.selectExpr("approx_pmf_estimate(merged, 4) pmf").show(1, False)
+--------------------------------------------------------------------------------------+
|pmf                                                                                   |
+--------------------------------------------------------------------------------------+
|[0.9250280810398008, 0.07003322180158443, 0.004825778691690984, 1.1291846692380381E-4]|
+--------------------------------------------------------------------------------------+

Configurations

Property Name	Default	Meaning
spark.sql.dataSketches.quantiles.sketchImpl	REQ	A sketch implementation used in quantile estimation functions.
spark.sql.dataSketches.quantiles.kll.k	200	Specifies the parameter `k` for the quantile sketch implementation named `KLL`, `KllFloatsSketch`.
spark.sql.dataSketches.quantiles.req.k	12	Specifies the parameter `k` for the quantile sketch implementation na

DataSketches experimental adapters for Apache Spark by tomerbd

DataSketches experimental adapters for Apache Spark by tomerbd

Share This Article

Newsletter

Quantile Sketches

Configurations

HackTech

Leave a comment Cancel reply

Editor's Choice

DataSketches experimental adapters for Apache Spark by tomerbd

DataSketches experimental adapters for Apache Spark by tomerbd

Share This Article

Newsletter

Quantile Sketches

Configurations

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter