cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Mado
by Valued Contributor II
  • 35797 Views
  • 3 replies
  • 10 kudos

Resolved! How to get all occurrences of duplicate records in a PySpark DataFrame based on specific columns?

Hi,I need to find all occurrences of duplicate records in a PySpark DataFrame. Following is the sample dataset:# Prepare Data data = [("A", "A", 1), \ ("A", "A", 2), \ ("A", "A", 3), \ ("A", "B", 4), \ ("A", "B", 5), \ ("A", "C", ...

image image
  • 35797 Views
  • 3 replies
  • 10 kudos
Latest Reply
NhatHoang
Valued Contributor II
  • 10 kudos

Hi,​In my experience, if you use dropDuplicates(), Spark will keep a random row.​Therefore, you should define a logic to remove duplicated rows.

  • 10 kudos
2 More Replies
Shalabh007
by Honored Contributor
  • 4437 Views
  • 5 replies
  • 19 kudos

Practice Exams for Databricks Certified Data Engineer Professional exam

Can anyone help with official Practice Exams set for Databricks Certified Data Engineer Professional exam, like we have below for Databricks Certified Data Engineer AssociatePractice exam for the Databricks Certified Data Engineer Associate exam

  • 4437 Views
  • 5 replies
  • 19 kudos
Latest Reply
Nayan7276
Valued Contributor II
  • 19 kudos

hi @Shalabh Agarwal​ I am not able to find any official practice paper. it is still not available.

  • 19 kudos
4 More Replies
AnubhavG
by Contributor
  • 2842 Views
  • 1 replies
  • 2 kudos

External APIs

Does databricks provide a way to integrate to external sw/API's? Whether it is in the form of UDF/external function? Can somebody point me how this can be achieved? My use case is to talk to external API's from databricks to perform certain operation...

  • 2842 Views
  • 1 replies
  • 2 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 2 kudos

You can write your own code to fetch data from external API.Example: https://insightsndata.com/how-to-call-rest-api-store-data-in-databricks-8383f2458d7d

  • 2 kudos
Ruby8376
by Valued Contributor
  • 3256 Views
  • 5 replies
  • 0 kudos

Resolved! Is there a way to get cdc data from salesforce to databricks? Can a smart pipeline be built to get near real time data from salesforce into delta lake?

Currently, we have daily batch running to extract data from salesforce into csv file (adls) which is further copied to delta tables for transformation. We are now looking to implement a solution which can extract real-time data changes on salesforce ...

  • 3256 Views
  • 5 replies
  • 0 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 0 kudos

On Azure you can try using SAP CDC connector for Data Factory:https://learn.microsoft.com/en-us/azure/data-factory/sap-change-data-capture-introduction-architecture

  • 0 kudos
4 More Replies
Himanshi
by New Contributor III
  • 1636 Views
  • 1 replies
  • 6 kudos

How to exclude the existing files when we need to move the streaming job from one databricks workspace to another databricks workspace that may not be compatible with the existing checkpoint state to resume the stream processing?

We do not want to process all the old files, we only wanted to process latest files. whenever we use the new checkpoint path in another databricks workspace, streaming job is processing all the old files as well. Without autoloader feature, is there ...

  • 1636 Views
  • 1 replies
  • 6 kudos
Latest Reply
Shalabh007
Honored Contributor
  • 6 kudos

@Himanshi Patle​ in spark streaming there is one option maxFileAge using which you can control which files to process based on their timestamp.

  • 6 kudos
Harun
by Honored Contributor
  • 10114 Views
  • 1 replies
  • 3 kudos

How to change the number of executors instances in databricks

I know that Databricks runs one executor per worker node. Can i change the no.of.exectors by adding params (spark.executor.instances) in the cluster advance option? and also can i pass this parameter when i schedule a task, so that particular task wi...

  • 10114 Views
  • 1 replies
  • 3 kudos
Latest Reply
karthik_p
Esteemed Contributor
  • 3 kudos

@Harun Raseed Basheer​ usually for 1 worker node 1 executor will be there, if we need to split that executor within that worker node itself, we can do that based on memory that core has been assigned , below configs can be use spark.executor.coresspa...

  • 3 kudos
AdamRink
by New Contributor III
  • 2166 Views
  • 2 replies
  • 6 kudos

How to limit batch size from Confluent Kafka

I have a large stream of data read from Confluent Kafka, 500+ millions of row. When I initialize the stream I cannot control the batch sizes that are read.I've tried setting options on the readstream - maxBytesPerTrigger, maxOffsetsPerTrigger, fetc...

  • 2166 Views
  • 2 replies
  • 6 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 6 kudos

Hi @Adam Rink​ Just checking for further info on your question. How are you deducing that the batch sizes are more than what you are providing as maxOffsetsPerTrigger ?

  • 6 kudos
1 More Replies
Tahseen0354
by Valued Contributor
  • 10724 Views
  • 13 replies
  • 35 kudos

How do I compare cost between databricks gcp and azure databricks ?

I have a databricks job running in azure databricks. A similar job is also running in databricks gcp. I would like to compare the cost. If I assign a custom tag to the job cluster running in azure databricks, I can see the cost incurred by that job i...

  • 10724 Views
  • 13 replies
  • 35 kudos
Latest Reply
Own
Contributor
  • 35 kudos

In Azure, you can use Cost Management to track your expenses incurred by Databricks instance.

  • 35 kudos
12 More Replies
ossinova
by Contributor II
  • 1390 Views
  • 1 replies
  • 0 kudos

Schedule reload of system.information_schema for external tables in platform

Probably not feasible, but is there a way to update (via STORED PROCEDURE, FUNCTION or SQL query) the information schema of all external tables within Databricks. Last updated that I can see was when I converted the tables to Unity. From my understa...

  • 1390 Views
  • 1 replies
  • 0 kudos
Latest Reply
Own
Contributor
  • 0 kudos

You can try optimize and cache with the internal tables such as schema tables to fetch updated information.

  • 0 kudos
rammy
by Contributor III
  • 3244 Views
  • 3 replies
  • 11 kudos

How would i retrieve data JSON data with namespaces using spark SQL?

File.json from the below code contains huge JSON data with each key containing namespace prefix(This JSON file converted from the XML file).I could able to retrieve if JSON does not contain namespaces but what could be the approach to retrieve record...

image.png image
  • 3244 Views
  • 3 replies
  • 11 kudos
Latest Reply
SS2
Valued Contributor
  • 11 kudos

I case of struct you can use (.) For extracting the value

  • 11 kudos
2 More Replies
allan-silva
by New Contributor III
  • 3906 Views
  • 3 replies
  • 4 kudos

Resolved! Can't create database - UnsupportedFileSystemException No FileSystem for scheme "dbfs"

I'm following a class "DE 3.1 - Databases and Tables on Databricks", but it is not possible create databases due to "AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: org.apache.hadoop.fs.Unsupp...

  • 3906 Views
  • 3 replies
  • 4 kudos
Latest Reply
allan-silva
New Contributor III
  • 4 kudos

A colleague from my work figured out the problem: the cluster being used wasn't configured to use DBFS when running notebooks.

  • 4 kudos
2 More Replies
Shiva_Dsouz
by New Contributor II
  • 1882 Views
  • 1 replies
  • 1 kudos

How to get spark streaming metrics like input rows, processed rows and batch duration to Prometheus for monitoring

I have been reading this article https://www.databricks.com/session_na20/native-support-of-prometheus-monitoring-in-apache-spark-3-0 and it has been mentioned that we can get the spark streaming metrics like input rows, processing rate and batch dura...

  • 1882 Views
  • 1 replies
  • 1 kudos
Latest Reply
SS2
Valued Contributor
  • 1 kudos

I think you can use spark UI to see deep level details ​

  • 1 kudos
andalo
by New Contributor II
  • 2602 Views
  • 3 replies
  • 2 kudos

Databricks cluster failure

do you help me with the next error?MessageCluster terminated. Reason: Azure Vm Extension FailureHelpInstance bootstrap failed.Failure message: Cloud Provider Failure. Azure VM Extension stuck on transitioning state. Please try again later.VM extensio...

  • 2602 Views
  • 3 replies
  • 2 kudos
Latest Reply
SS2
Valued Contributor
  • 2 kudos

You can restart the cluster and check once.​

  • 2 kudos
2 More Replies
mickniz
by Contributor
  • 4038 Views
  • 6 replies
  • 10 kudos

What is the best way to take care of Drop and Rename a column in Schema evaluation.

I would need some suggestion from DataBricks Folks. As per documentation in Schema Evaluation for Drop and Rename Data is overwritten. Does it means we loose data (because I read data is not deleted but kind of staged). Is it possible to query old da...

  • 4038 Views
  • 6 replies
  • 10 kudos
Latest Reply
SS2
Valued Contributor
  • 10 kudos

Overwritte ​option will overwritte your data. If you want to change column name then you can first alter the delta table as per your need then you can append new data as well. So both problems you can resolve

  • 10 kudos
5 More Replies
Shirley
by New Contributor III
  • 8622 Views
  • 12 replies
  • 8 kudos

Cluster terminated after 120 mins and cannot restart

Last night the cluster was working properly, but this morning the cluster was terminated automatically and cannot be restarted. Got an error message under sparkUI: Could not find data to load UI for driver 5526297689623955253 in cluster 1125-062259-i...

  • 8622 Views
  • 12 replies
  • 8 kudos
Latest Reply
SS2
Valued Contributor
  • 8 kudos

Then can use.​

  • 8 kudos
11 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels