This repository provies Apache DataSketches experimental adapters for Apache Spark.
Please visit the main website for more DataSketches information.
Quantile Sketches
Like the built-in percentile estimation function (approx_percentile
),
this plugin enalbes you to use an alternative function (approx_percentile_ex
) to estimate percentiles
in a theoretically-meageable and very compact way:
Moreover, this plugin provies functionalities to accumulate quantile summaries for each time interval and
estimate quantile values over specific intervals later just like the Snowflake built-in functions:
+———-+
| estimated|
+———-+
| 3.25|
+———-+
>>> df.selectExpr(“approx_pmf_estimate(merged, 4) pmf”).show(1, False)
+————————————————————————————–+
|pmf |
+————————————————————————————–+
|[0.9250280810398008, 0.07003322180158443, 0.004825778691690984, 1.1291846692380381E-4]|
+————————————————————————————–+”>
>>> import pyspark.sql.functions as f
>>> summaries = df.groupBy(f.window("Date", "1 week")).agg(f.expr("approx_percentile_accumulate(Global_active_power) AS summaries"))
>>> summaries.show(3, 50)
+------------------------------------------+--------------------------------------------------+
| window| summaries|
+------------------------------------------+--------------------------------------------------+
|{2006-12-14 09:00:00, 2006-12-21 09:00:00}|[04 01 11 28 0C 00 07 00 AA 1D 00 00 00 00 00 0...|
|{2009-12-03 09:00:00, 2009-12-10 09:00:00}|[04 01 11 28 0C 00 05 00 9E 05 00 00 00 00 00 0...|
|{2009-10-22 09:00:00, 2009-10-29 09:00:00}|[04 01 11 28 0C 00 07 00 60 27 00 00 00 00 00 0...|
+------------------------------------------+--------------------------------------------------+
only showing top 3 rows
# Correct percentile of the `Global_active_power` column
scala> df.where("Date between '2007-06-01' and '2010-01-01'").selectExpr("percentile(Global_active_power, 0.95) correct").show()
+-------+
|correct|
+-------+
| 3.236|
+-------+
# Estimated percentile of the `Global_active_power` column
>>> df = summaries.where("window.start > '2007-06-01' and window.end < '2010-01-01'").selectExpr("approx_percentile_combine(summaries) merged")
>>> df.selectExpr("approx_percentile_estimate(merged, 0.95) percentile").show()
+----------+
| estimated|
+----------+
| 3.25|
+----------+
>>> df.selectExpr("approx_pmf_estimate(merged, 4) pmf").show(1, False)
+--------------------------------------------------------------------------------------+
|pmf |
+--------------------------------------------------------------------------------------+
|[0.9250280810398008, 0.07003322180158443, 0.004825778691690984, 1.1291846692380381E-4]|
+--------------------------------------------------------------------------------------+
Configurations
Property Name | Default | Meaning |
---|---|---|
spark.sql.dataSketches.quantiles.sketchImpl | REQ | A sketch implementation used in quantile estimation functions. |
spark.sql.dataSketches.quantiles.kll.k | 200 | Specifies the parameter k for the quantile sketch implementation named KLL , KllFloatsSketch . |
spark.sql.dataSketches.quantiles.req.k | 12 | Specifies the parameter k for the quantile sketch implementation na |
>