cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

DK03
by Contributor
  • 1364 Views
  • 2 replies
  • 2 kudos
  • 1364 Views
  • 2 replies
  • 2 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 2 kudos

As @Werner Stinckens​ said, it would be ok. But generally decimal column joins are not recommended as other factors come into play like the precision, length etc...Also when you are joining in on decimal columns, be sure to check out the abs value of...

  • 2 kudos
1 More Replies
fury88
by New Contributor II
  • 1318 Views
  • 1 replies
  • 1 kudos

Does CACHE TABLE/VIEW have a create or replace like view?

I'm trying to cache data/queries that we normally have as temporary views that get replaced when the code is run based on dynamic python. What I'd like to know is will CACHE TABLE get overwritten each time you run it? Is it smart enough to recognize ...

  • 1318 Views
  • 1 replies
  • 1 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 1 kudos

Hi @Matt Fury​ Yes...I guess cache overwrites each time you run it because for me it took nearly same amount of time for 1million records to be cached. However, you can check whether the table is cached or not using .storageLevel method. E.g. I have...

  • 1 kudos
Durbinar
by New Contributor III
  • 3448 Views
  • 4 replies
  • 4 kudos

Resolved! Azure Databricks Default DNS

My Azure Databricks workspace default DNS is #168.63.129.16, this DNS doesn't seem to resolve azure storage accounts which were created a year ago, after tweaking the cluster to use 8.8.8.8 then able to resolve desired storage accounts, is there a d...

  • 3448 Views
  • 4 replies
  • 4 kudos
Latest Reply
Durbinar
New Contributor III
  • 4 kudos

IP address 168.63.129.16 is a virtual public IP address that is used to facilitate a communication channel to Azure platform resources. Customers can define any address space for their private virtual network in Azure. Therefore, the Azure platform...

  • 4 kudos
3 More Replies
200723
by New Contributor II
  • 1893 Views
  • 4 replies
  • 4 kudos

"No SRV records" intermittent error when running Databricks Pyspark to connect Mongo Atlas

My Mongo Atlas connect url is like mongodb+srv://<srv_hostname>I don't want to use direct url like mongodb://<hostname1, hostname2, hostname3....> because our Mongo Atlas global clusters have many hosts. It would be hard to maintain.Our java programs...

  • 1893 Views
  • 4 replies
  • 4 kudos
Latest Reply
Noopur_Nigam
Valued Contributor II
  • 4 kudos

Hi @Raymond Lai​  The issue looks to be on the Mongo DB connector. The connection is created and maintained by the mongo-spark connector. You can try using the direct mongodb hosts in the connection string instead of SRV to avoid doing DNS lookups or...

  • 4 kudos
3 More Replies
Dicer
by Valued Contributor
  • 5811 Views
  • 5 replies
  • 7 kudos

Is it reasonable for the process "Determining the location of DBIO file fragments." to take me 7 hours?

I only have 1000 columns. Each column has 252 rows, so there are only 252000 data points.How come it can route tasks for the best-cached locality for 7 hours?

  • 5811 Views
  • 5 replies
  • 7 kudos
Latest Reply
Noopur_Nigam
Valued Contributor II
  • 7 kudos

Hi @Cheuk Hin Christophe Poon​ have you optimize your table anytime since it's creation? If not, then optimize may take some time depending on the no of underlying files.Please try to run optimize manually as described in below document:https://docs....

  • 7 kudos
4 More Replies
shrutis23
by New Contributor III
  • 3104 Views
  • 5 replies
  • 4 kudos

How to use delta live table with google cloud storage

Hi Team I have been working on a POC exploring delta live table with GCS location. I have some doubts :how to access the gcs bucket. We have connection established using databricks service account. In a normal cluster creation , we go to cluster page...

  • 3104 Views
  • 5 replies
  • 4 kudos
Latest Reply
Senthil1
Contributor
  • 4 kudos

Kindly mount the DBFS location to GCS cloud storage, see belowMounting cloud object storage on Databricks | Databricks on Google Cloud

  • 4 kudos
4 More Replies
SS2
by Valued Contributor
  • 3510 Views
  • 4 replies
  • 3 kudos

Spark out of memory error.You can resolve this error by increasing the size of cluster in Databricks.

Spark out of memory error.You can resolve this error by increasing the size of cluster in Databricks.

  • 3510 Views
  • 4 replies
  • 3 kudos
Latest Reply
DK03
Contributor
  • 3 kudos

Adding some more points to @karthik p​ 's answer.Use kryo serializer instead of java serializer.Use an optimised garbage collector such as G1GC.Use partitioning wisely on a field.

  • 3 kudos
3 More Replies
cchiulan
by New Contributor III
  • 1948 Views
  • 3 replies
  • 7 kudos

Databricks Log4J Custom Appender Not Working as expected

I'm trying to figure out how a custom appender should be configured in a Databricks environment but I cannot figure it out.When cluster is running, in `driver logs`, time is displayed as 'unknown' for my custom log file and when cluster is stopped, c...

  • 1948 Views
  • 3 replies
  • 7 kudos
Latest Reply
Wolf
New Contributor II
  • 7 kudos

We're having the same problem with 11.3 LTS. Are there any updates? We would like to deliver log4j messages from Databricks Notebooks to custom log files and then upload those to S3 or DBFS. Best

  • 7 kudos
2 More Replies
Mado
by Valued Contributor II
  • 24903 Views
  • 3 replies
  • 10 kudos

Resolved! How to get all occurrences of duplicate records in a PySpark DataFrame based on specific columns?

Hi,I need to find all occurrences of duplicate records in a PySpark DataFrame. Following is the sample dataset:# Prepare Data data = [("A", "A", 1), \ ("A", "A", 2), \ ("A", "A", 3), \ ("A", "B", 4), \ ("A", "B", 5), \ ("A", "C", ...

image image
  • 24903 Views
  • 3 replies
  • 10 kudos
Latest Reply
NhatHoang
Valued Contributor II
  • 10 kudos

Hi,​In my experience, if you use dropDuplicates(), Spark will keep a random row.​Therefore, you should define a logic to remove duplicated rows.

  • 10 kudos
2 More Replies
Shalabh007
by Honored Contributor
  • 3389 Views
  • 5 replies
  • 19 kudos

Practice Exams for Databricks Certified Data Engineer Professional exam

Can anyone help with official Practice Exams set for Databricks Certified Data Engineer Professional exam, like we have below for Databricks Certified Data Engineer AssociatePractice exam for the Databricks Certified Data Engineer Associate exam

  • 3389 Views
  • 5 replies
  • 19 kudos
Latest Reply
Nayan7276
Valued Contributor II
  • 19 kudos

hi @Shalabh Agarwal​ I am not able to find any official practice paper. it is still not available.

  • 19 kudos
4 More Replies
AnubhavG
by Contributor
  • 1843 Views
  • 1 replies
  • 2 kudos

External APIs

Does databricks provide a way to integrate to external sw/API's? Whether it is in the form of UDF/external function? Can somebody point me how this can be achieved? My use case is to talk to external API's from databricks to perform certain operation...

  • 1843 Views
  • 1 replies
  • 2 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 2 kudos

You can write your own code to fetch data from external API.Example: https://insightsndata.com/how-to-call-rest-api-store-data-in-databricks-8383f2458d7d

  • 2 kudos
Ruby8376
by Valued Contributor
  • 2264 Views
  • 5 replies
  • 0 kudos

Resolved! Is there a way to get cdc data from salesforce to databricks? Can a smart pipeline be built to get near real time data from salesforce into delta lake?

Currently, we have daily batch running to extract data from salesforce into csv file (adls) which is further copied to delta tables for transformation. We are now looking to implement a solution which can extract real-time data changes on salesforce ...

  • 2264 Views
  • 5 replies
  • 0 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 0 kudos

On Azure you can try using SAP CDC connector for Data Factory:https://learn.microsoft.com/en-us/azure/data-factory/sap-change-data-capture-introduction-architecture

  • 0 kudos
4 More Replies
Himanshi
by New Contributor III
  • 1074 Views
  • 1 replies
  • 6 kudos

How to exclude the existing files when we need to move the streaming job from one databricks workspace to another databricks workspace that may not be compatible with the existing checkpoint state to resume the stream processing?

We do not want to process all the old files, we only wanted to process latest files. whenever we use the new checkpoint path in another databricks workspace, streaming job is processing all the old files as well. Without autoloader feature, is there ...

  • 1074 Views
  • 1 replies
  • 6 kudos
Latest Reply
Shalabh007
Honored Contributor
  • 6 kudos

@Himanshi Patle​ in spark streaming there is one option maxFileAge using which you can control which files to process based on their timestamp.

  • 6 kudos
Harun
by Honored Contributor
  • 2387 Views
  • 1 replies
  • 1 kudos

How to change the number of executors instances in databricks

I know that Databricks runs one executor per worker node. Can i change the no.of.exectors by adding params (spark.executor.instances) in the cluster advance option? and also can i pass this parameter when i schedule a task, so that particular task wi...

  • 2387 Views
  • 1 replies
  • 1 kudos
Latest Reply
karthik_p
Esteemed Contributor
  • 1 kudos

@Harun Raseed Basheer​ usually for 1 worker node 1 executor will be there, if we need to split that executor within that worker node itself, we can do that based on memory that core has been assigned , below configs can be use spark.executor.coresspa...

  • 1 kudos
AdamRink
by New Contributor III
  • 1515 Views
  • 2 replies
  • 6 kudos

How to limit batch size from Confluent Kafka

I have a large stream of data read from Confluent Kafka, 500+ millions of row. When I initialize the stream I cannot control the batch sizes that are read.I've tried setting options on the readstream - maxBytesPerTrigger, maxOffsetsPerTrigger, fetc...

  • 1515 Views
  • 2 replies
  • 6 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 6 kudos

Hi @Adam Rink​ Just checking for further info on your question. How are you deducing that the batch sizes are more than what you are providing as maxOffsetsPerTrigger ?

  • 6 kudos
1 More Replies
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels