Data Engineering

Forum Posts

Sorted by:

by raman • New Contributor II

12-18-2022 4:08:54 AM

577 Views
2 replies
0 kudos

Spark pushdown filter not being respected on dbfs

I have a parquet files with a column g1 with schemaStructField(g1,IntegerType,true)Now I have a query with filter on g1.What's weird in the SQL viewer is that spark is loading all the rows from that file. Even though in the physical plan I can see th...

Data Engineering

577 Views
2 replies
0 kudos

12-18-2022 4:08:54 AM

View Replies

Latest Reply

raman
New Contributor II

12-19-2022 12:18:35 AM

0 kudos

Thanks @Ajay Pandey pls find attached the physical plan.Query: Select identityMap, segmentMembership, _repo, workEmail, person, homePhone, workPhone, workAddress, personalEmail, homeAddress from final_segment_index_table_v2 where (g1 >= 128 AND g1 <...

0 kudos

12-19-2022 12:18:35 AM

1 More Replies

by Kopal • New Contributor II

12-17-2022 5:17:10 PM

2790 Views
3 replies
3 kudos

Resolved! Data Engineering - CTAS - External Tables - Limitations of CTAS for external tables - can or cannot use options and location

Data Engineering - CTAS - External TablesCan someone help me understand why In chapter 3.3, we cannot not directly use CTAS with OPTIONS and LOCATION to specify delimiter and location of CSV?Or I misunderstood?Details:In Data Engineering with Databri...

Data Engineering

2790 Views
3 replies
3 kudos

12-17-2022 5:17:10 PM

View Replies

Latest Reply

Anonymous
Not applicable

12-18-2022 12:15:00 PM

3 kudos

The 2nd statement CTAS will not be able to parse the csv in any manner because it's just the from statement that points to a file. It's more of a traditional SQL statement with select and from. It will create a Delta Table. This just happens to b...

3 kudos

12-18-2022 12:15:00 PM

2 More Replies

by jt • New Contributor III

12-16-2022 9:24:41 AM

979 Views
2 replies
1 kudos

Table of Content consistency

When I click on header "STEP 3" in the table of contents, it takes me to the correct section. However, when I click on "STEP 2" - the table of contents stays on "STEP 3". This sometime causes confusion. For consistency, is there any way to highligh...

Data Engineering

979 Views
2 replies
1 kudos

12-16-2022 9:24:41 AM

View Replies

Latest Reply

jt
New Contributor III

12-18-2022 9:29:32 AM

1 kudos

If you click on cell "Command-4", does the table of contact (on the left) highlight "Command-4"?

1 kudos

12-18-2022 9:29:32 AM

1 More Replies

by User16869510359 • Esteemed Contributor

06-25-2021 3:53:53 PM

902 Views
1 replies
2 kudos

Resolved! Why do my jobs fail with the "Reason: Remote RPC client disassociated." exception

Data Engineering

902 Views
1 replies
2 kudos

06-25-2021 3:53:53 PM

View Replies

Latest Reply

Aviral-Bhardwaj
Esteemed Contributor III

12-17-2022 11:39:45 PM

2 kudos

because your driver is not able to talk with your nodes for this you can add configuration where you can increase databricks heartbeat interval and you can also add rpc max size this will also help.you can explore cluster configuration from here- htt...

2 kudos

12-17-2022 11:39:45 PM

by spott_submittab • New Contributor II

11-04-2022 3:46:45 PM

563 Views
1 replies
0 kudos

A Job "pool"? (or task pool)

I'm trying to run a single job multiple times with different parameters where the number of concurrent jobs is less than the number of parameters.I have a job (or task...) J that takes parameter set p, I have 100 p values I want to run, however I onl...

Data Engineering

563 Views
1 replies
0 kudos

11-04-2022 3:46:45 PM

View Replies

Latest Reply

Aviral-Bhardwaj
Esteemed Contributor III

12-17-2022 11:37:38 PM

0 kudos

this is something new,interesting question, try to reach out databricks support team, maybe they have some good idea here

0 kudos

12-17-2022 11:37:38 PM

by Kaniz • Community Manager

09-21-2021 10:52:17 AM

541 Views
1 replies
1 kudos

Does Apache Spark 3 support GPU usage for Spark RDDs?

Data Engineering

541 Views
1 replies
1 kudos

09-21-2021 10:52:17 AM

View Replies

Latest Reply

Aviral-Bhardwaj
Esteemed Contributor III

12-17-2022 11:32:49 PM

1 kudos

You can have a look on this doc, I hope it helpshttps://www.databricks.com/session_na20/deep-dive-into-gpu-support-in-apache-spark-3-x

1 kudos

12-17-2022 11:32:49 PM

by aka1 • New Contributor II

11-22-2022 10:23:20 AM

931 Views
1 replies
3 kudos

dbx - run unit test error (java.lang.NoSuchMethodError)

I am setting up dbx for the fist time on Windows 10. Strictly following https://dbx.readthedocs.io/en/latest/guides/python/python_quickstart/openjdk is installed conda install -c conda-forge openjdk=11.0.15winutils.exe for Hadoop 3 is downloaded, pat...

Data Engineering

931 Views
1 replies
3 kudos

11-22-2022 10:23:20 AM

View Replies

Latest Reply

Aviral-Bhardwaj
Esteemed Contributor III

12-17-2022 11:28:09 PM

3 kudos

this seems code issue only

3 kudos

12-17-2022 11:28:09 PM

by MaximS • New Contributor

12-16-2022 7:25:35 AM

843 Views
1 replies
1 kudos

OPTIMIZE command failed to complete on partitioned dataset

Trying to optimize delta table with following stats:size: 212,848 blobs, 31,162,417,246,985 bytescommand: OPTIMIZE <table> ZORDER BY (X, Y, Z)In Spark UI I can see all work divided to batches, and each batch start with 400 tasks to collect data. But ...

Data Engineering

843 Views
1 replies
1 kudos

12-16-2022 7:25:35 AM

View Replies

Latest Reply

Aviral-Bhardwaj
Esteemed Contributor III

12-17-2022 10:49:34 PM

1 kudos

can you share some sample datasets for this by that we can debug and help you accordingly ThanksAviral

1 kudos

12-17-2022 10:49:34 PM

by auser85 • New Contributor III

12-16-2022 7:49:44 AM

2308 Views
1 replies
1 kudos

How to incorporate these GC options into my Databricks Cluster? )(spark.executor.extraJavaOptions)

I want to try incorporating these options into my databricks cluster.spark.driver.extraJavaOptions -XX:+UseG1GC -XX:+G1SummarizeConcMark spark.executor.extraJavaOptions -XX:+UseG1GC -XX:+G1SummarizeConcMarkIf I put them under Compute -> Cluster -> Co...

Data Engineering

2308 Views
1 replies
1 kudos

12-16-2022 7:49:44 AM

View Replies

Latest Reply

Aviral-Bhardwaj
Esteemed Contributor III

12-17-2022 10:47:13 PM

1 kudos

hey @Andrew Fogarty , I think this is only for the spark-submit command, not for cluster UI.Please have a look at this doc - http://progexc.blogspot.com/2014/12/spark-configuration-mess-solved.htmlspark.executor.extraJavaOptionsA string of extra JVM...

1 kudos

12-17-2022 10:47:13 PM

by RajibRajib_Mand • New Contributor III

12-16-2022 5:36:30 PM

1076 Views
3 replies
2 kudos

Multiple Databricks cluster in same workspace

Hi All,I have created three cluster(dev,qa,prod)in the same workspace to isolate data for different environment.How do we differentiate environment while running job using dev it should update data for dev environment?Regards,Rajib

Data Engineering

1076 Views
3 replies
2 kudos

12-16-2022 5:36:30 PM

View Replies

Latest Reply

Aviral-Bhardwaj
Esteemed Contributor III

12-17-2022 10:06:42 PM

2 kudos

hey @Rajib Rajib Mandal , this is very easy, i have done this multiple times, you can segregate data using your IAM role that is attached to the cluster, it is known as an Instance profile, you can only give the dev data access to dev role and the s...

2 kudos

12-17-2022 10:06:42 PM

2 More Replies

by SIRIGIRI • Contributor

12-17-2022 6:36:08 AM

372 Views
1 replies
1 kudos

medium.com

Sorting In Spark**How to sort null values First and last of the records in the Spark data frame?Please find the answershttps://medium.com/@sharikrishna26/sorting-in-spark-a57db245ecd4

Data Engineering

372 Views
1 replies
1 kudos

12-17-2022 6:36:08 AM

View Replies

Latest Reply

Aviral-Bhardwaj
Esteemed Contributor III

12-17-2022 10:01:09 PM

1 kudos

Yeah this is really good post,keep it up Man

1 kudos

12-17-2022 10:01:09 PM

by Aviral-Bhardwaj • Esteemed Contributor III

12-17-2022 9:54:33 PM

657 Views
0 replies
31 kudos

Understanding Cluster Pools Sometimes we want to run our databricks code without any delay as reports are very emergency like the upstream team wants ...

Understanding Cluster PoolsSometimes we want to run our databricks code without any delay as reports are very emergency like the upstream team wants to save as much time as they can save in the starting cluster.That time we can use the pool of cluste...

Data Engineering

657 Views
0 replies
31 kudos

12-17-2022 9:54:33 PM

by Aviral-Bhardwaj • Esteemed Contributor III

12-17-2022 9:46:48 PM

890 Views
0 replies
31 kudos

Databricks New Runtime Version is Available Now PySpark memory profiling- Memory profiling is now enabled for PySpark user-defined functions. This pr...

Databricks New Runtime Version is Available Now PySpark memory profiling- Memory profiling is now enabled for PySpark user-defined functions. This provides information on memory increment, memory usage, and number of occurrences for each line of code...

Data Engineering

890 Views
0 replies
31 kudos

12-17-2022 9:46:48 PM

by ahana • New Contributor III

08-11-2022 11:57:40 PM

1251 Views
1 replies
2 kudos

error too large report

hi i am trying to pull the data from quick base but it is giving me error-: too large reportbelow are the code i used@%pythondf = quickbasePull('b5zj8k_pbz5_0_cd5h4wbb77n4nvp95b4u','bq2nq8jm7',4)2) i tried below code but its not displaying in correc...

Data Engineering

1251 Views
1 replies
2 kudos

08-11-2022 11:57:40 PM

View Replies

Latest Reply

Aviral-Bhardwaj
Esteemed Contributor III

12-17-2022 9:27:59 PM

2 kudos

Hey @ahana ahana ,this code is not working

2 kudos

12-17-2022 9:27:59 PM

by rammy • Contributor III

12-15-2022 9:29:42 AM

4574 Views
6 replies
5 kudos

How I could read the Job id, run id and parameters in python cell?

I have tried following ways to get job parameters but none of the things are working.runId='{{run_id}}' jobId='{{job_id}}' filepath='{{filepath}}' print(runId," ",jobId," ",filepath) r1=dbutils.widgets.get('{{run_id}}') f1=dbutils.widgets.get('{{file...

Data Engineering

4574 Views
6 replies
5 kudos

12-15-2022 9:29:42 AM

View Replies

Latest Reply

rammy
Contributor III

12-17-2022 9:17:32 AM

5 kudos

Thanks for your response. I found the solution. The below code gives me all the job parametersall_args = dbutils.notebook.entry_point.getCurrentBindings()print(all_args)Thanks for your support

5 kudos

12-17-2022 9:17:32 AM

5 More Replies

User

Count

1602

736

343

284

247

Databricks

Forum Posts

Spark pushdown filter not being respected on dbfs

Resolved! Data Engineering - CTAS - External Tables - Limitations of CTAS for external tables - can or cannot use options and location

Table of Content consistency

Resolved! Why do my jobs fail with the "Reason: Remote RPC client disassociated." exception

A Job "pool"? (or task pool)

Does Apache Spark 3 support GPU usage for Spark RDDs?

dbx - run unit test error (java.lang.NoSuchMethodError)

OPTIMIZE command failed to complete on partitioned dataset

How to incorporate these GC options into my Databricks Cluster? )(spark.executor.extraJavaOptions)

Multiple Databricks cluster in same workspace

medium.com

Understanding Cluster Pools Sometimes we want to run our databricks code without any delay as reports are very emergency like the upstream team wants ...

Databricks New Runtime Version is Available Now PySpark memory profiling- Memory profiling is now enabled for PySpark user-defined functions. This pr...

error too large report

How I could read the Job id, run id and parameters in python cell?

Best way to parse Google Analytics data in Databri...

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...