cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Durbinar
by New Contributor III
  • 7780 Views
  • 4 replies
  • 4 kudos

Resolved! Azure Databricks Default DNS

My Azure Databricks workspace default DNS is #168.63.129.16, this DNS doesn't seem to resolve azure storage accounts which were created a year ago, after tweaking the cluster to use 8.8.8.8 then able to resolve desired storage accounts, is there a d...

  • 7780 Views
  • 4 replies
  • 4 kudos
Latest Reply
Durbinar
New Contributor III
  • 4 kudos

IP address 168.63.129.16 is a virtual public IP address that is used to facilitate a communication channel to Azure platform resources. Customers can define any address space for their private virtual network in Azure. Therefore, the Azure platform...

  • 4 kudos
3 More Replies
200723
by New Contributor II
  • 3853 Views
  • 3 replies
  • 3 kudos

"No SRV records" intermittent error when running Databricks Pyspark to connect Mongo Atlas

My Mongo Atlas connect url is like mongodb+srv://<srv_hostname>I don't want to use direct url like mongodb://<hostname1, hostname2, hostname3....> because our Mongo Atlas global clusters have many hosts. It would be hard to maintain.Our java programs...

  • 3853 Views
  • 3 replies
  • 3 kudos
Latest Reply
Noopur_Nigam
Databricks Employee
  • 3 kudos

Hi @Raymond Lai​  The issue looks to be on the Mongo DB connector. The connection is created and maintained by the mongo-spark connector. You can try using the direct mongodb hosts in the connection string instead of SRV to avoid doing DNS lookups or...

  • 3 kudos
2 More Replies
Dicer
by Valued Contributor
  • 8986 Views
  • 4 replies
  • 5 kudos

Is it reasonable for the process "Determining the location of DBIO file fragments." to take me 7 hours?

I only have 1000 columns. Each column has 252 rows, so there are only 252000 data points.How come it can route tasks for the best-cached locality for 7 hours?

  • 8986 Views
  • 4 replies
  • 5 kudos
Latest Reply
Noopur_Nigam
Databricks Employee
  • 5 kudos

Hi @Cheuk Hin Christophe Poon​ have you optimize your table anytime since it's creation? If not, then optimize may take some time depending on the no of underlying files.Please try to run optimize manually as described in below document:https://docs....

  • 5 kudos
3 More Replies
shrutis23
by New Contributor III
  • 6733 Views
  • 4 replies
  • 4 kudos

How to use delta live table with google cloud storage

Hi Team I have been working on a POC exploring delta live table with GCS location. I have some doubts :how to access the gcs bucket. We have connection established using databricks service account. In a normal cluster creation , we go to cluster page...

  • 6733 Views
  • 4 replies
  • 4 kudos
Latest Reply
Senthil1
Databricks Partner
  • 4 kudos

Kindly mount the DBFS location to GCS cloud storage, see belowMounting cloud object storage on Databricks | Databricks on Google Cloud

  • 4 kudos
3 More Replies
SS2
by Valued Contributor
  • 11476 Views
  • 4 replies
  • 3 kudos

Spark out of memory error.You can resolve this error by increasing the size of cluster in Databricks.

Spark out of memory error.You can resolve this error by increasing the size of cluster in Databricks.

  • 11476 Views
  • 4 replies
  • 3 kudos
Latest Reply
DK03
Contributor
  • 3 kudos

Adding some more points to @karthik p​ 's answer.Use kryo serializer instead of java serializer.Use an optimised garbage collector such as G1GC.Use partitioning wisely on a field.

  • 3 kudos
3 More Replies
cchiulan
by Databricks Partner
  • 4398 Views
  • 3 replies
  • 7 kudos

Databricks Log4J Custom Appender Not Working as expected

I'm trying to figure out how a custom appender should be configured in a Databricks environment but I cannot figure it out.When cluster is running, in `driver logs`, time is displayed as 'unknown' for my custom log file and when cluster is stopped, c...

  • 4398 Views
  • 3 replies
  • 7 kudos
Latest Reply
Wolf
New Contributor II
  • 7 kudos

We're having the same problem with 11.3 LTS. Are there any updates? We would like to deliver log4j messages from Databricks Notebooks to custom log files and then upload those to S3 or DBFS. Best

  • 7 kudos
2 More Replies
Mado
by Valued Contributor II
  • 50520 Views
  • 3 replies
  • 10 kudos

Resolved! How to get all occurrences of duplicate records in a PySpark DataFrame based on specific columns?

Hi,I need to find all occurrences of duplicate records in a PySpark DataFrame. Following is the sample dataset:# Prepare Data data = [("A", "A", 1), \ ("A", "A", 2), \ ("A", "A", 3), \ ("A", "B", 4), \ ("A", "B", 5), \ ("A", "C", ...

image image
  • 50520 Views
  • 3 replies
  • 10 kudos
Latest Reply
NhatHoang
Valued Contributor II
  • 10 kudos

Hi,​In my experience, if you use dropDuplicates(), Spark will keep a random row.​Therefore, you should define a logic to remove duplicated rows.

  • 10 kudos
2 More Replies
AnubhavG
by Contributor
  • 4152 Views
  • 1 replies
  • 2 kudos

External APIs

Does databricks provide a way to integrate to external sw/API's? Whether it is in the form of UDF/external function? Can somebody point me how this can be achieved? My use case is to talk to external API's from databricks to perform certain operation...

  • 4152 Views
  • 1 replies
  • 2 kudos
Latest Reply
daniel_sahal
Databricks MVP
  • 2 kudos

You can write your own code to fetch data from external API.Example: https://insightsndata.com/how-to-call-rest-api-store-data-in-databricks-8383f2458d7d

  • 2 kudos
Ruby8376
by Valued Contributor
  • 5847 Views
  • 5 replies
  • 0 kudos

Resolved! Is there a way to get cdc data from salesforce to databricks? Can a smart pipeline be built to get near real time data from salesforce into delta lake?

Currently, we have daily batch running to extract data from salesforce into csv file (adls) which is further copied to delta tables for transformation. We are now looking to implement a solution which can extract real-time data changes on salesforce ...

  • 5847 Views
  • 5 replies
  • 0 kudos
Latest Reply
daniel_sahal
Databricks MVP
  • 0 kudos

On Azure you can try using SAP CDC connector for Data Factory:https://learn.microsoft.com/en-us/azure/data-factory/sap-change-data-capture-introduction-architecture

  • 0 kudos
4 More Replies
Himanshi
by New Contributor III
  • 2825 Views
  • 1 replies
  • 6 kudos

How to exclude the existing files when we need to move the streaming job from one databricks workspace to another databricks workspace that may not be compatible with the existing checkpoint state to resume the stream processing?

We do not want to process all the old files, we only wanted to process latest files. whenever we use the new checkpoint path in another databricks workspace, streaming job is processing all the old files as well. Without autoloader feature, is there ...

  • 2825 Views
  • 1 replies
  • 6 kudos
Latest Reply
Shalabh007
Honored Contributor
  • 6 kudos

@Himanshi Patle​ in spark streaming there is one option maxFileAge using which you can control which files to process based on their timestamp.

  • 6 kudos
AdamRink
by New Contributor III
  • 3330 Views
  • 2 replies
  • 6 kudos

How to limit batch size from Confluent Kafka

I have a large stream of data read from Confluent Kafka, 500+ millions of row. When I initialize the stream I cannot control the batch sizes that are read.I've tried setting options on the readstream - maxBytesPerTrigger, maxOffsetsPerTrigger, fetc...

  • 3330 Views
  • 2 replies
  • 6 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 6 kudos

Hi @Adam Rink​ Just checking for further info on your question. How are you deducing that the batch sizes are more than what you are providing as maxOffsetsPerTrigger ?

  • 6 kudos
1 More Replies
Tahseen0354
by Valued Contributor
  • 16695 Views
  • 13 replies
  • 35 kudos

How do I compare cost between databricks gcp and azure databricks ?

I have a databricks job running in azure databricks. A similar job is also running in databricks gcp. I would like to compare the cost. If I assign a custom tag to the job cluster running in azure databricks, I can see the cost incurred by that job i...

  • 16695 Views
  • 13 replies
  • 35 kudos
Latest Reply
Own
Contributor
  • 35 kudos

In Azure, you can use Cost Management to track your expenses incurred by Databricks instance.

  • 35 kudos
12 More Replies
ossinova
by Contributor II
  • 2371 Views
  • 1 replies
  • 0 kudos

Schedule reload of system.information_schema for external tables in platform

Probably not feasible, but is there a way to update (via STORED PROCEDURE, FUNCTION or SQL query) the information schema of all external tables within Databricks. Last updated that I can see was when I converted the tables to Unity. From my understa...

  • 2371 Views
  • 1 replies
  • 0 kudos
Latest Reply
Own
Contributor
  • 0 kudos

You can try optimize and cache with the internal tables such as schema tables to fetch updated information.

  • 0 kudos
rammy
by Contributor III
  • 6358 Views
  • 3 replies
  • 11 kudos

How would i retrieve data JSON data with namespaces using spark SQL?

File.json from the below code contains huge JSON data with each key containing namespace prefix(This JSON file converted from the XML file).I could able to retrieve if JSON does not contain namespaces but what could be the approach to retrieve record...

image.png image
  • 6358 Views
  • 3 replies
  • 11 kudos
Latest Reply
SS2
Valued Contributor
  • 11 kudos

I case of struct you can use (.) For extracting the value

  • 11 kudos
2 More Replies
allan-silva
by New Contributor III
  • 7173 Views
  • 3 replies
  • 4 kudos

Resolved! Can't create database - UnsupportedFileSystemException No FileSystem for scheme "dbfs"

I'm following a class "DE 3.1 - Databases and Tables on Databricks", but it is not possible create databases due to "AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: org.apache.hadoop.fs.Unsupp...

  • 7173 Views
  • 3 replies
  • 4 kudos
Latest Reply
allan-silva
New Contributor III
  • 4 kudos

A colleague from my work figured out the problem: the cluster being used wasn't configured to use DBFS when running notebooks.

  • 4 kudos
2 More Replies
Labels