cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

Ajay-Pandey
by Esteemed Contributor III
  • 1513 Views
  • 2 replies
  • 9 kudos

Kafka integration with Databricks

Hi allI want to integrate Kafka with databricks if anyone can share any doc or code it will help me a lot.Thanks in advance

  • 1513 Views
  • 2 replies
  • 9 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 9 kudos

This is code that I am using to read from KafkainputDF = (spark .readStream .format("kafka") .option("kafka.bootstrap.servers", host) .option("kafka.ssl.endpoint.identification.algorithm", "https") .option("kafka.sasl.mechanism", "PLAIN") .option("ka...

  • 9 kudos
1 More Replies
rpshgupta
by New Contributor III
  • 7055 Views
  • 8 replies
  • 6 kudos

Databricks notebook failed with "Caused by: java.io.FileNotFoundException: Operation failed: "The specified path does not exist.", 404, HEAD, https://adls.dfs.core.windows.net/raw/file.csv?upn=false&action=getStatus&timeout=90".

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 458.0 failed 4 times, most recent failure: Lost task 0.3 in stage 458.0 (TID 2247) (172.18.102.75 executor 1): com.databricks.sql.io.FileReadException: Error while rea...

  • 7055 Views
  • 8 replies
  • 6 kudos
Latest Reply
Vidula
Honored Contributor
  • 6 kudos

Hi @Rupesh gupta​ Hope you are well. Just wanted to see if you were able to find an answer to your question and would you like to mark an answer as best? It would be really helpful for the other members too.Cheers!

  • 6 kudos
7 More Replies
Ryan_Chynoweth
by Esteemed Contributor
  • 3318 Views
  • 2 replies
  • 4 kudos

Connecting to Azure SQL from Azure Databricks with firewalls

We are trying to connect to an Azure SQL Server from Azure Databricks using JDBC, but have faced issues because our firewall blocks everything. We decided to whitelist IPs from the SQL Server side and add a public subnet to make the connection work. ...

  • 3318 Views
  • 2 replies
  • 4 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 4 kudos

Using subnets for Databricks connectivity is the correct thing to do. This way you ensure the resources (clusters) can connect to the SQL Database. We also recommend using NPIP (No Public IPs) so that there won't be any public ip associated with the...

  • 4 kudos
1 More Replies
nameziane
by New Contributor III
  • 7667 Views
  • 4 replies
  • 2 kudos

Set version (VERSION AS OF) dynamically from return of a subquery

Hello,We have a business request to compare the evolution in a certain delta table.We would like to compare the latest version of the table with the previous one using Delta time travel.The main issue we are facing is to retrieve programmatically us...

  • 7667 Views
  • 4 replies
  • 2 kudos
Latest Reply
apingle
Contributor
  • 2 kudos

In the docs it says that "'Neither timestamp_expression nor version can be subqueries." So it does sound challenging. I also tried playing with widgets to see if it could be populated using SQL but didn't succeed. With python it's really easy to do.

  • 2 kudos
3 More Replies
cmilligan
by Contributor II
  • 1798 Views
  • 3 replies
  • 4 kudos

Resolved! Pass through if a job was run as scheduled or if manual

I have a notebook that sets up parameters for the run based on some job parameters set by the user as well as the current date of the run. I want to supersede some of this logic and just use the manual values if kicked off manually. Is there a way to...

  • 1798 Views
  • 3 replies
  • 4 kudos
Latest Reply
SS2
Valued Contributor
  • 4 kudos

You can create widgets by using this- dbutils.widgets.text("widgetName", "")To get the value for that widget:- dbutils.widgets.get("widgetName")So by using this you can manually create widgets (variable) and can run the process by giving desired valu...

  • 4 kudos
2 More Replies
rrussell25
by New Contributor
  • 1041 Views
  • 1 replies
  • 0 kudos

Read arguments in a scala note invoked by a job.

In a scala note, how to I read input arguments (e.g. those proved by a job that runs a scala notebook). In python, dbutils.notebook.entry_point.getCurrentBindings() works. How about for scala.

  • 1041 Views
  • 1 replies
  • 0 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 0 kudos

Hi @Robert Russell​ You can use dbutils.notebook.getContext.currentRunId in scala notebooks. Other methods are also available likedbutils.notebook.getContext.jobGroupdbutils.notebook.getContext.rootRunId dbutils.notebook.getContext.tags etc...You ...

  • 0 kudos
RajibRajib_Mand
by New Contributor III
  • 2718 Views
  • 2 replies
  • 0 kudos

Reading Password protected excel(.xlsx) file in databricks

I want to read password protected excel file and load the data delta table.Can you pleas let me know how this can be achieved in databricks?

  • 2718 Views
  • 2 replies
  • 0 kudos
Latest Reply
igorsalo22
New Contributor II
  • 0 kudos

df = spark.read.format("com.crealytics.spark.excel")\  .option("dataAddress", "'Base'!A1")\  .option("header", "true")\  .option("workbookPassword", "test")\  .load("test.xlsx")display(df)

  • 0 kudos
1 More Replies
DK03
by Contributor
  • 1625 Views
  • 2 replies
  • 2 kudos
  • 1625 Views
  • 2 replies
  • 2 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 2 kudos

As @Werner Stinckens​ said, it would be ok. But generally decimal column joins are not recommended as other factors come into play like the precision, length etc...Also when you are joining in on decimal columns, be sure to check out the abs value of...

  • 2 kudos
1 More Replies
fury88
by New Contributor II
  • 1561 Views
  • 1 replies
  • 1 kudos

Does CACHE TABLE/VIEW have a create or replace like view?

I'm trying to cache data/queries that we normally have as temporary views that get replaced when the code is run based on dynamic python. What I'd like to know is will CACHE TABLE get overwritten each time you run it? Is it smart enough to recognize ...

  • 1561 Views
  • 1 replies
  • 1 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 1 kudos

Hi @Matt Fury​ Yes...I guess cache overwrites each time you run it because for me it took nearly same amount of time for 1million records to be cached. However, you can check whether the table is cached or not using .storageLevel method. E.g. I have...

  • 1 kudos
Durbinar
by New Contributor III
  • 4090 Views
  • 4 replies
  • 4 kudos

Resolved! Azure Databricks Default DNS

My Azure Databricks workspace default DNS is #168.63.129.16, this DNS doesn't seem to resolve azure storage accounts which were created a year ago, after tweaking the cluster to use 8.8.8.8 then able to resolve desired storage accounts, is there a d...

  • 4090 Views
  • 4 replies
  • 4 kudos
Latest Reply
Durbinar
New Contributor III
  • 4 kudos

IP address 168.63.129.16 is a virtual public IP address that is used to facilitate a communication channel to Azure platform resources. Customers can define any address space for their private virtual network in Azure. Therefore, the Azure platform...

  • 4 kudos
3 More Replies
200723
by New Contributor II
  • 2169 Views
  • 4 replies
  • 4 kudos

"No SRV records" intermittent error when running Databricks Pyspark to connect Mongo Atlas

My Mongo Atlas connect url is like mongodb+srv://<srv_hostname>I don't want to use direct url like mongodb://<hostname1, hostname2, hostname3....> because our Mongo Atlas global clusters have many hosts. It would be hard to maintain.Our java programs...

  • 2169 Views
  • 4 replies
  • 4 kudos
Latest Reply
Noopur_Nigam
Valued Contributor
  • 4 kudos

Hi @Raymond Lai​  The issue looks to be on the Mongo DB connector. The connection is created and maintained by the mongo-spark connector. You can try using the direct mongodb hosts in the connection string instead of SRV to avoid doing DNS lookups or...

  • 4 kudos
3 More Replies
Dicer
by Valued Contributor
  • 6231 Views
  • 5 replies
  • 7 kudos

Is it reasonable for the process "Determining the location of DBIO file fragments." to take me 7 hours?

I only have 1000 columns. Each column has 252 rows, so there are only 252000 data points.How come it can route tasks for the best-cached locality for 7 hours?

  • 6231 Views
  • 5 replies
  • 7 kudos
Latest Reply
Noopur_Nigam
Valued Contributor
  • 7 kudos

Hi @Cheuk Hin Christophe Poon​ have you optimize your table anytime since it's creation? If not, then optimize may take some time depending on the no of underlying files.Please try to run optimize manually as described in below document:https://docs....

  • 7 kudos
4 More Replies
shrutis23
by New Contributor III
  • 3536 Views
  • 5 replies
  • 4 kudos

How to use delta live table with google cloud storage

Hi Team I have been working on a POC exploring delta live table with GCS location. I have some doubts :how to access the gcs bucket. We have connection established using databricks service account. In a normal cluster creation , we go to cluster page...

  • 3536 Views
  • 5 replies
  • 4 kudos
Latest Reply
Senthil1
Contributor
  • 4 kudos

Kindly mount the DBFS location to GCS cloud storage, see belowMounting cloud object storage on Databricks | Databricks on Google Cloud

  • 4 kudos
4 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels