cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

SS2
by Valued Contributor
  • 11412 Views
  • 4 replies
  • 3 kudos

Spark out of memory error.You can resolve this error by increasing the size of cluster in Databricks.

Spark out of memory error.You can resolve this error by increasing the size of cluster in Databricks.

  • 11412 Views
  • 4 replies
  • 3 kudos
Latest Reply
DK03
Contributor
  • 3 kudos

Adding some more points to @karthik p​ 's answer.Use kryo serializer instead of java serializer.Use an optimised garbage collector such as G1GC.Use partitioning wisely on a field.

  • 3 kudos
3 More Replies
cchiulan
by Databricks Partner
  • 4390 Views
  • 3 replies
  • 7 kudos

Databricks Log4J Custom Appender Not Working as expected

I'm trying to figure out how a custom appender should be configured in a Databricks environment but I cannot figure it out.When cluster is running, in `driver logs`, time is displayed as 'unknown' for my custom log file and when cluster is stopped, c...

  • 4390 Views
  • 3 replies
  • 7 kudos
Latest Reply
Wolf
New Contributor II
  • 7 kudos

We're having the same problem with 11.3 LTS. Are there any updates? We would like to deliver log4j messages from Databricks Notebooks to custom log files and then upload those to S3 or DBFS. Best

  • 7 kudos
2 More Replies
Mado
by Valued Contributor II
  • 50419 Views
  • 3 replies
  • 10 kudos

Resolved! How to get all occurrences of duplicate records in a PySpark DataFrame based on specific columns?

Hi,I need to find all occurrences of duplicate records in a PySpark DataFrame. Following is the sample dataset:# Prepare Data data = [("A", "A", 1), \ ("A", "A", 2), \ ("A", "A", 3), \ ("A", "B", 4), \ ("A", "B", 5), \ ("A", "C", ...

image image
  • 50419 Views
  • 3 replies
  • 10 kudos
Latest Reply
NhatHoang
Valued Contributor II
  • 10 kudos

Hi,​In my experience, if you use dropDuplicates(), Spark will keep a random row.​Therefore, you should define a logic to remove duplicated rows.

  • 10 kudos
2 More Replies
AnubhavG
by Contributor
  • 4146 Views
  • 1 replies
  • 2 kudos

External APIs

Does databricks provide a way to integrate to external sw/API's? Whether it is in the form of UDF/external function? Can somebody point me how this can be achieved? My use case is to talk to external API's from databricks to perform certain operation...

  • 4146 Views
  • 1 replies
  • 2 kudos
Latest Reply
daniel_sahal
Databricks MVP
  • 2 kudos

You can write your own code to fetch data from external API.Example: https://insightsndata.com/how-to-call-rest-api-store-data-in-databricks-8383f2458d7d

  • 2 kudos
Ruby8376
by Valued Contributor
  • 5783 Views
  • 5 replies
  • 0 kudos

Resolved! Is there a way to get cdc data from salesforce to databricks? Can a smart pipeline be built to get near real time data from salesforce into delta lake?

Currently, we have daily batch running to extract data from salesforce into csv file (adls) which is further copied to delta tables for transformation. We are now looking to implement a solution which can extract real-time data changes on salesforce ...

  • 5783 Views
  • 5 replies
  • 0 kudos
Latest Reply
daniel_sahal
Databricks MVP
  • 0 kudos

On Azure you can try using SAP CDC connector for Data Factory:https://learn.microsoft.com/en-us/azure/data-factory/sap-change-data-capture-introduction-architecture

  • 0 kudos
4 More Replies
Himanshi
by New Contributor III
  • 2819 Views
  • 1 replies
  • 6 kudos

How to exclude the existing files when we need to move the streaming job from one databricks workspace to another databricks workspace that may not be compatible with the existing checkpoint state to resume the stream processing?

We do not want to process all the old files, we only wanted to process latest files. whenever we use the new checkpoint path in another databricks workspace, streaming job is processing all the old files as well. Without autoloader feature, is there ...

  • 2819 Views
  • 1 replies
  • 6 kudos
Latest Reply
Shalabh007
Honored Contributor
  • 6 kudos

@Himanshi Patle​ in spark streaming there is one option maxFileAge using which you can control which files to process based on their timestamp.

  • 6 kudos
AdamRink
by New Contributor III
  • 3318 Views
  • 2 replies
  • 6 kudos

How to limit batch size from Confluent Kafka

I have a large stream of data read from Confluent Kafka, 500+ millions of row. When I initialize the stream I cannot control the batch sizes that are read.I've tried setting options on the readstream - maxBytesPerTrigger, maxOffsetsPerTrigger, fetc...

  • 3318 Views
  • 2 replies
  • 6 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 6 kudos

Hi @Adam Rink​ Just checking for further info on your question. How are you deducing that the batch sizes are more than what you are providing as maxOffsetsPerTrigger ?

  • 6 kudos
1 More Replies
Tahseen0354
by Valued Contributor
  • 16602 Views
  • 13 replies
  • 35 kudos

How do I compare cost between databricks gcp and azure databricks ?

I have a databricks job running in azure databricks. A similar job is also running in databricks gcp. I would like to compare the cost. If I assign a custom tag to the job cluster running in azure databricks, I can see the cost incurred by that job i...

  • 16602 Views
  • 13 replies
  • 35 kudos
Latest Reply
Own
Contributor
  • 35 kudos

In Azure, you can use Cost Management to track your expenses incurred by Databricks instance.

  • 35 kudos
12 More Replies
ossinova
by Contributor II
  • 2364 Views
  • 1 replies
  • 0 kudos

Schedule reload of system.information_schema for external tables in platform

Probably not feasible, but is there a way to update (via STORED PROCEDURE, FUNCTION or SQL query) the information schema of all external tables within Databricks. Last updated that I can see was when I converted the tables to Unity. From my understa...

  • 2364 Views
  • 1 replies
  • 0 kudos
Latest Reply
Own
Contributor
  • 0 kudos

You can try optimize and cache with the internal tables such as schema tables to fetch updated information.

  • 0 kudos
rammy
by Contributor III
  • 6345 Views
  • 3 replies
  • 11 kudos

How would i retrieve data JSON data with namespaces using spark SQL?

File.json from the below code contains huge JSON data with each key containing namespace prefix(This JSON file converted from the XML file).I could able to retrieve if JSON does not contain namespaces but what could be the approach to retrieve record...

image.png image
  • 6345 Views
  • 3 replies
  • 11 kudos
Latest Reply
SS2
Valued Contributor
  • 11 kudos

I case of struct you can use (.) For extracting the value

  • 11 kudos
2 More Replies
allan-silva
by New Contributor III
  • 7117 Views
  • 3 replies
  • 4 kudos

Resolved! Can't create database - UnsupportedFileSystemException No FileSystem for scheme "dbfs"

I'm following a class "DE 3.1 - Databases and Tables on Databricks", but it is not possible create databases due to "AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: org.apache.hadoop.fs.Unsupp...

  • 7117 Views
  • 3 replies
  • 4 kudos
Latest Reply
allan-silva
New Contributor III
  • 4 kudos

A colleague from my work figured out the problem: the cluster being used wasn't configured to use DBFS when running notebooks.

  • 4 kudos
2 More Replies
Shiva_Dsouz
by New Contributor II
  • 2914 Views
  • 1 replies
  • 1 kudos

How to get spark streaming metrics like input rows, processed rows and batch duration to Prometheus for monitoring

I have been reading this article https://www.databricks.com/session_na20/native-support-of-prometheus-monitoring-in-apache-spark-3-0 and it has been mentioned that we can get the spark streaming metrics like input rows, processing rate and batch dura...

  • 2914 Views
  • 1 replies
  • 1 kudos
Latest Reply
SS2
Valued Contributor
  • 1 kudos

I think you can use spark UI to see deep level details ​

  • 1 kudos
andalo
by New Contributor II
  • 4131 Views
  • 3 replies
  • 2 kudos

Databricks cluster failure

do you help me with the next error?MessageCluster terminated. Reason: Azure Vm Extension FailureHelpInstance bootstrap failed.Failure message: Cloud Provider Failure. Azure VM Extension stuck on transitioning state. Please try again later.VM extensio...

  • 4131 Views
  • 3 replies
  • 2 kudos
Latest Reply
SS2
Valued Contributor
  • 2 kudos

You can restart the cluster and check once.​

  • 2 kudos
2 More Replies
mickniz
by Contributor
  • 6250 Views
  • 6 replies
  • 10 kudos

What is the best way to take care of Drop and Rename a column in Schema evaluation.

I would need some suggestion from DataBricks Folks. As per documentation in Schema Evaluation for Drop and Rename Data is overwritten. Does it means we loose data (because I read data is not deleted but kind of staged). Is it possible to query old da...

  • 6250 Views
  • 6 replies
  • 10 kudos
Latest Reply
SS2
Valued Contributor
  • 10 kudos

Overwritte ​option will overwritte your data. If you want to change column name then you can first alter the delta table as per your need then you can append new data as well. So both problems you can resolve

  • 10 kudos
5 More Replies
Shirley
by New Contributor III
  • 14450 Views
  • 12 replies
  • 8 kudos

Cluster terminated after 120 mins and cannot restart

Last night the cluster was working properly, but this morning the cluster was terminated automatically and cannot be restarted. Got an error message under sparkUI: Could not find data to load UI for driver 5526297689623955253 in cluster 1125-062259-i...

  • 14450 Views
  • 12 replies
  • 8 kudos
Latest Reply
SS2
Valued Contributor
  • 8 kudos

Then can use.​

  • 8 kudos
11 More Replies
Labels