Data Engineering

Forum Posts

Sorted by:

by chhavibansal • New Contributor III

01-17-2023 1:22:22 AM

1473 Views
1 replies
0 kudos

What is the upper bound limit for dataSkippingNumIndexedCols, to keeps stats in delta log file?

Is there an upper bound of number that i can assign to delta.dataSkippingNumIndexedCols for computing statistics. Is there some tradeoff benchmark available for increasing this number beyond 32.

Data Engineering

1473 Views
1 replies
0 kudos

01-17-2023 1:22:22 AM

View Replies

Latest Reply

Anonymous
Not applicable

03-08-2023 8:21:43 PM

0 kudos

@Chhavi Bansal :The delta.dataSkippingNumIndexedCols configuration property controls the maximum number of columns that Delta Lake will build statistics on during data skipping. By default, this value is set to 32. There is no hard upper bound on th...

0 kudos

03-08-2023 8:21:43 PM

by chhavibansal • New Contributor III

11-18-2022 11:08:00 AM

5514 Views
4 replies
1 kudos

ANALYZE TABLE showing NULLs for all statistics in Spark

var df2 = spark.read .format("csv") .option("sep", ",") .option("header", "true") .option("inferSchema", "true") .load("src/main/resources/datasets/titanic.csv") df2.createOrReplaceTempView("titanic") spark.table("titanic").cach...

Data Engineering

5514 Views
4 replies
1 kudos

11-18-2022 11:08:00 AM

View Replies

Latest Reply

chhavibansal
New Contributor III

12-03-2022 11:12:25 PM

1 kudos

can you share what the *newtitanic* is I think that you would have done something similarspark.sql("create table newtitanic as select * from titanic")something like this works for me, but the issue is i first make a temp view then again create a tab...

1 kudos

12-03-2022 11:12:25 PM

3 More Replies

by aladda • Databricks Employee

06-23-2021 9:26:12 PM

3514 Views
1 replies
0 kudos

Resolved! How are stats collected on a Delta column utilized

Data Engineering

3514 Views
1 replies
0 kudos

06-23-2021 9:26:12 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-23-2021 9:28:13 PM

0 kudos

Stats collected on a Delta column are either using for Partitioning Pruning, Data Skipping. See here - https://docs.databricks.com/delta/optimizations/file-mgmt.html#delta-data-skipping for detailsIn additional stats are also used for Metadata only q...

0 kudos

06-23-2021 9:28:13 PM

by aladda • Databricks Employee

06-23-2021 9:25:48 PM

1841 Views
0 replies
0 kudos

What are the recommendations around collecting stats on long strings in a Delta Table

It is best to avoid collecting stats on long strings. You typically want to collect stats on column that are used in filter, where clauses, joins and on which you tend to performance aggregations - typically numerical valuesYou can avoid collecting s...

Data Engineering

1841 Views
0 replies
0 kudos

06-23-2021 9:25:48 PM

by aladda • Databricks Employee

06-23-2021 9:19:52 PM

6649 Views
1 replies
0 kudos

Resolved! How many columns does Delta Engine collect stats on for a Delta Table

Data Engineering

6649 Views
1 replies
0 kudos

06-23-2021 9:19:52 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-23-2021 9:22:03 PM

0 kudos

by default a delta table has stats collected on the first 32 columns. This setting can be configured using the following.set spark.databricks.delta.properties.defaults.dataSkippingNumIndexedCols = 3However there's a time trade-off to having a large n...

0 kudos

06-23-2021 9:22:03 PM

by dshosseinyousef • New Contributor II

09-22-2016 1:29:26 AM

10175 Views
2 replies
0 kudos

how to Calculate quantile on grouped data in spark Dataframe

I have the following sparkdataframe : agent_id/ payment_amount a /1000 b /1100 a /1100 a /1200 b /1200 b /1250 a /10000 b /9000 my desire output would be something like <code>agen_id 95_quantile a whatever is95 quantile for a...

Data Engineering

10175 Views
2 replies
0 kudos

09-22-2016 1:29:26 AM

View Replies

Latest Reply

Weiluo__David_R
New Contributor II

12-30-2016 10:17:54 AM

0 kudos

For those of you who haven't run into this SO thread http://stackoverflow.com/questions/39633614/calculate-quantile-on-grouped-data-in-spark-dataframe, it's pointed out there that one work-around is to use HIVE UDF "percentile_approx". Please see th...

0 kudos

12-30-2016 10:17:54 AM

1 More Replies

Databricks Community

What is the upper bound limit for dataSkippingNumIndexedCols, to keeps stats in delta log file?

ANALYZE TABLE showing NULLs for all statistics in Spark

Resolved! How are stats collected on a Delta column utilized

What are the recommendations around collecting stats on long strings in a Delta Table

Resolved! How many columns does Delta Engine collect stats on for a Delta Table

how to Calculate quantile on grouped data in spark Dataframe