cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

nameziane
by New Contributor III
  • 14960 Views
  • 4 replies
  • 2 kudos

Set version (VERSION AS OF) dynamically from return of a subquery

Hello,We have a business request to compare the evolution in a certain delta table.We would like to compare the latest version of the table with the previous one using Delta time travel.The main issue we are facing is to retrieve programmatically us...

  • 14960 Views
  • 4 replies
  • 2 kudos
Latest Reply
apingle
Contributor
  • 2 kudos

In the docs it says that "'Neither timestamp_expression nor version can be subqueries." So it does sound challenging. I also tried playing with widgets to see if it could be populated using SQL but didn't succeed. With python it's really easy to do.

  • 2 kudos
3 More Replies
cmilligan
by Contributor II
  • 2718 Views
  • 3 replies
  • 4 kudos

Resolved! Pass through if a job was run as scheduled or if manual

I have a notebook that sets up parameters for the run based on some job parameters set by the user as well as the current date of the run. I want to supersede some of this logic and just use the manual values if kicked off manually. Is there a way to...

  • 2718 Views
  • 3 replies
  • 4 kudos
Latest Reply
SS2
Valued Contributor
  • 4 kudos

You can create widgets by using this- dbutils.widgets.text("widgetName", "")To get the value for that widget:- dbutils.widgets.get("widgetName")So by using this you can manually create widgets (variable) and can run the process by giving desired valu...

  • 4 kudos
2 More Replies
rrussell25
by New Contributor
  • 1571 Views
  • 1 replies
  • 0 kudos

Read arguments in a scala note invoked by a job.

In a scala note, how to I read input arguments (e.g. those proved by a job that runs a scala notebook). In python, dbutils.notebook.entry_point.getCurrentBindings() works. How about for scala.

  • 1571 Views
  • 1 replies
  • 0 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 0 kudos

Hi @Robert Russell​ You can use dbutils.notebook.getContext.currentRunId in scala notebooks. Other methods are also available likedbutils.notebook.getContext.jobGroupdbutils.notebook.getContext.rootRunId dbutils.notebook.getContext.tags etc...You ...

  • 0 kudos
RajibRajib_Mand
by New Contributor III
  • 4345 Views
  • 2 replies
  • 0 kudos

Reading Password protected excel(.xlsx) file in databricks

I want to read password protected excel file and load the data delta table.Can you pleas let me know how this can be achieved in databricks?

  • 4345 Views
  • 2 replies
  • 0 kudos
Latest Reply
igorsalo22
New Contributor II
  • 0 kudos

df = spark.read.format("com.crealytics.spark.excel")\  .option("dataAddress", "'Base'!A1")\  .option("header", "true")\  .option("workbookPassword", "test")\  .load("test.xlsx")display(df)

  • 0 kudos
1 More Replies
DK03
by Contributor
  • 2611 Views
  • 2 replies
  • 2 kudos
  • 2611 Views
  • 2 replies
  • 2 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 2 kudos

As @Werner Stinckens​ said, it would be ok. But generally decimal column joins are not recommended as other factors come into play like the precision, length etc...Also when you are joining in on decimal columns, be sure to check out the abs value of...

  • 2 kudos
1 More Replies
fury88
by New Contributor II
  • 2743 Views
  • 1 replies
  • 1 kudos

Does CACHE TABLE/VIEW have a create or replace like view?

I'm trying to cache data/queries that we normally have as temporary views that get replaced when the code is run based on dynamic python. What I'd like to know is will CACHE TABLE get overwritten each time you run it? Is it smart enough to recognize ...

  • 2743 Views
  • 1 replies
  • 1 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 1 kudos

Hi @Matt Fury​ Yes...I guess cache overwrites each time you run it because for me it took nearly same amount of time for 1million records to be cached. However, you can check whether the table is cached or not using .storageLevel method. E.g. I have...

  • 1 kudos
Durbinar
by New Contributor III
  • 6348 Views
  • 4 replies
  • 4 kudos

Resolved! Azure Databricks Default DNS

My Azure Databricks workspace default DNS is #168.63.129.16, this DNS doesn't seem to resolve azure storage accounts which were created a year ago, after tweaking the cluster to use 8.8.8.8 then able to resolve desired storage accounts, is there a d...

  • 6348 Views
  • 4 replies
  • 4 kudos
Latest Reply
Durbinar
New Contributor III
  • 4 kudos

IP address 168.63.129.16 is a virtual public IP address that is used to facilitate a communication channel to Azure platform resources. Customers can define any address space for their private virtual network in Azure. Therefore, the Azure platform...

  • 4 kudos
3 More Replies
200723
by New Contributor II
  • 3185 Views
  • 3 replies
  • 3 kudos

"No SRV records" intermittent error when running Databricks Pyspark to connect Mongo Atlas

My Mongo Atlas connect url is like mongodb+srv://<srv_hostname>I don't want to use direct url like mongodb://<hostname1, hostname2, hostname3....> because our Mongo Atlas global clusters have many hosts. It would be hard to maintain.Our java programs...

  • 3185 Views
  • 3 replies
  • 3 kudos
Latest Reply
Noopur_Nigam
Databricks Employee
  • 3 kudos

Hi @Raymond Lai​  The issue looks to be on the Mongo DB connector. The connection is created and maintained by the mongo-spark connector. You can try using the direct mongodb hosts in the connection string instead of SRV to avoid doing DNS lookups or...

  • 3 kudos
2 More Replies
Dicer
by Valued Contributor
  • 7965 Views
  • 4 replies
  • 5 kudos

Is it reasonable for the process "Determining the location of DBIO file fragments." to take me 7 hours?

I only have 1000 columns. Each column has 252 rows, so there are only 252000 data points.How come it can route tasks for the best-cached locality for 7 hours?

  • 7965 Views
  • 4 replies
  • 5 kudos
Latest Reply
Noopur_Nigam
Databricks Employee
  • 5 kudos

Hi @Cheuk Hin Christophe Poon​ have you optimize your table anytime since it's creation? If not, then optimize may take some time depending on the no of underlying files.Please try to run optimize manually as described in below document:https://docs....

  • 5 kudos
3 More Replies
shrutis23
by New Contributor III
  • 5369 Views
  • 4 replies
  • 4 kudos

How to use delta live table with google cloud storage

Hi Team I have been working on a POC exploring delta live table with GCS location. I have some doubts :how to access the gcs bucket. We have connection established using databricks service account. In a normal cluster creation , we go to cluster page...

  • 5369 Views
  • 4 replies
  • 4 kudos
Latest Reply
Senthil1
Contributor
  • 4 kudos

Kindly mount the DBFS location to GCS cloud storage, see belowMounting cloud object storage on Databricks | Databricks on Google Cloud

  • 4 kudos
3 More Replies
SS2
by Valued Contributor
  • 8162 Views
  • 4 replies
  • 3 kudos

Spark out of memory error.You can resolve this error by increasing the size of cluster in Databricks.

Spark out of memory error.You can resolve this error by increasing the size of cluster in Databricks.

  • 8162 Views
  • 4 replies
  • 3 kudos
Latest Reply
DK03
Contributor
  • 3 kudos

Adding some more points to @karthik p​ 's answer.Use kryo serializer instead of java serializer.Use an optimised garbage collector such as G1GC.Use partitioning wisely on a field.

  • 3 kudos
3 More Replies
cchiulan
by New Contributor III
  • 3433 Views
  • 3 replies
  • 7 kudos

Databricks Log4J Custom Appender Not Working as expected

I'm trying to figure out how a custom appender should be configured in a Databricks environment but I cannot figure it out.When cluster is running, in `driver logs`, time is displayed as 'unknown' for my custom log file and when cluster is stopped, c...

  • 3433 Views
  • 3 replies
  • 7 kudos
Latest Reply
Wolf
New Contributor II
  • 7 kudos

We're having the same problem with 11.3 LTS. Are there any updates? We would like to deliver log4j messages from Databricks Notebooks to custom log files and then upload those to S3 or DBFS. Best

  • 7 kudos
2 More Replies
Mado
by Valued Contributor II
  • 43752 Views
  • 3 replies
  • 10 kudos

Resolved! How to get all occurrences of duplicate records in a PySpark DataFrame based on specific columns?

Hi,I need to find all occurrences of duplicate records in a PySpark DataFrame. Following is the sample dataset:# Prepare Data data = [("A", "A", 1), \ ("A", "A", 2), \ ("A", "A", 3), \ ("A", "B", 4), \ ("A", "B", 5), \ ("A", "C", ...

image image
  • 43752 Views
  • 3 replies
  • 10 kudos
Latest Reply
NhatHoang
Valued Contributor II
  • 10 kudos

Hi,​In my experience, if you use dropDuplicates(), Spark will keep a random row.​Therefore, you should define a logic to remove duplicated rows.

  • 10 kudos
2 More Replies
Shalabh007
by Honored Contributor
  • 5286 Views
  • 5 replies
  • 19 kudos

Practice Exams for Databricks Certified Data Engineer Professional exam

Can anyone help with official Practice Exams set for Databricks Certified Data Engineer Professional exam, like we have below for Databricks Certified Data Engineer AssociatePractice exam for the Databricks Certified Data Engineer Associate exam

  • 5286 Views
  • 5 replies
  • 19 kudos
Latest Reply
Nayan7276
Valued Contributor II
  • 19 kudos

hi @Shalabh Agarwal​ I am not able to find any official practice paper. it is still not available.

  • 19 kudos
4 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels