cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

RajibRajib_Mand
by New Contributor III
  • 2714 Views
  • 2 replies
  • 0 kudos

Reading Password protected excel(.xlsx) file in databricks

I want to read password protected excel file and load the data delta table.Can you pleas let me know how this can be achieved in databricks?

  • 2714 Views
  • 2 replies
  • 0 kudos
Latest Reply
igorsalo22
New Contributor II
  • 0 kudos

df = spark.read.format("com.crealytics.spark.excel")\  .option("dataAddress", "'Base'!A1")\  .option("header", "true")\  .option("workbookPassword", "test")\  .load("test.xlsx")display(df)

  • 0 kudos
1 More Replies
DK03
by Contributor
  • 1612 Views
  • 2 replies
  • 2 kudos
  • 1612 Views
  • 2 replies
  • 2 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 2 kudos

As @Werner Stinckens​ said, it would be ok. But generally decimal column joins are not recommended as other factors come into play like the precision, length etc...Also when you are joining in on decimal columns, be sure to check out the abs value of...

  • 2 kudos
1 More Replies
fury88
by New Contributor II
  • 1550 Views
  • 1 replies
  • 1 kudos

Does CACHE TABLE/VIEW have a create or replace like view?

I'm trying to cache data/queries that we normally have as temporary views that get replaced when the code is run based on dynamic python. What I'd like to know is will CACHE TABLE get overwritten each time you run it? Is it smart enough to recognize ...

  • 1550 Views
  • 1 replies
  • 1 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 1 kudos

Hi @Matt Fury​ Yes...I guess cache overwrites each time you run it because for me it took nearly same amount of time for 1million records to be cached. However, you can check whether the table is cached or not using .storageLevel method. E.g. I have...

  • 1 kudos
Durbinar
by New Contributor III
  • 4059 Views
  • 4 replies
  • 4 kudos

Resolved! Azure Databricks Default DNS

My Azure Databricks workspace default DNS is #168.63.129.16, this DNS doesn't seem to resolve azure storage accounts which were created a year ago, after tweaking the cluster to use 8.8.8.8 then able to resolve desired storage accounts, is there a d...

  • 4059 Views
  • 4 replies
  • 4 kudos
Latest Reply
Durbinar
New Contributor III
  • 4 kudos

IP address 168.63.129.16 is a virtual public IP address that is used to facilitate a communication channel to Azure platform resources. Customers can define any address space for their private virtual network in Azure. Therefore, the Azure platform...

  • 4 kudos
3 More Replies
200723
by New Contributor II
  • 2163 Views
  • 4 replies
  • 4 kudos

"No SRV records" intermittent error when running Databricks Pyspark to connect Mongo Atlas

My Mongo Atlas connect url is like mongodb+srv://<srv_hostname>I don't want to use direct url like mongodb://<hostname1, hostname2, hostname3....> because our Mongo Atlas global clusters have many hosts. It would be hard to maintain.Our java programs...

  • 2163 Views
  • 4 replies
  • 4 kudos
Latest Reply
Noopur_Nigam
Valued Contributor II
  • 4 kudos

Hi @Raymond Lai​  The issue looks to be on the Mongo DB connector. The connection is created and maintained by the mongo-spark connector. You can try using the direct mongodb hosts in the connection string instead of SRV to avoid doing DNS lookups or...

  • 4 kudos
3 More Replies
Dicer
by Valued Contributor
  • 6216 Views
  • 5 replies
  • 7 kudos

Is it reasonable for the process "Determining the location of DBIO file fragments." to take me 7 hours?

I only have 1000 columns. Each column has 252 rows, so there are only 252000 data points.How come it can route tasks for the best-cached locality for 7 hours?

  • 6216 Views
  • 5 replies
  • 7 kudos
Latest Reply
Noopur_Nigam
Valued Contributor II
  • 7 kudos

Hi @Cheuk Hin Christophe Poon​ have you optimize your table anytime since it's creation? If not, then optimize may take some time depending on the no of underlying files.Please try to run optimize manually as described in below document:https://docs....

  • 7 kudos
4 More Replies
shrutis23
by New Contributor III
  • 3514 Views
  • 5 replies
  • 4 kudos

How to use delta live table with google cloud storage

Hi Team I have been working on a POC exploring delta live table with GCS location. I have some doubts :how to access the gcs bucket. We have connection established using databricks service account. In a normal cluster creation , we go to cluster page...

  • 3514 Views
  • 5 replies
  • 4 kudos
Latest Reply
Senthil1
Contributor
  • 4 kudos

Kindly mount the DBFS location to GCS cloud storage, see belowMounting cloud object storage on Databricks | Databricks on Google Cloud

  • 4 kudos
4 More Replies
SS2
by Valued Contributor
  • 4318 Views
  • 4 replies
  • 3 kudos

Spark out of memory error.You can resolve this error by increasing the size of cluster in Databricks.

Spark out of memory error.You can resolve this error by increasing the size of cluster in Databricks.

  • 4318 Views
  • 4 replies
  • 3 kudos
Latest Reply
DK03
Contributor
  • 3 kudos

Adding some more points to @karthik p​ 's answer.Use kryo serializer instead of java serializer.Use an optimised garbage collector such as G1GC.Use partitioning wisely on a field.

  • 3 kudos
3 More Replies
cchiulan
by New Contributor III
  • 2246 Views
  • 3 replies
  • 7 kudos

Databricks Log4J Custom Appender Not Working as expected

I'm trying to figure out how a custom appender should be configured in a Databricks environment but I cannot figure it out.When cluster is running, in `driver logs`, time is displayed as 'unknown' for my custom log file and when cluster is stopped, c...

  • 2246 Views
  • 3 replies
  • 7 kudos
Latest Reply
Wolf
New Contributor II
  • 7 kudos

We're having the same problem with 11.3 LTS. Are there any updates? We would like to deliver log4j messages from Databricks Notebooks to custom log files and then upload those to S3 or DBFS. Best

  • 7 kudos
2 More Replies
Mado
by Valued Contributor II
  • 28800 Views
  • 3 replies
  • 10 kudos

Resolved! How to get all occurrences of duplicate records in a PySpark DataFrame based on specific columns?

Hi,I need to find all occurrences of duplicate records in a PySpark DataFrame. Following is the sample dataset:# Prepare Data data = [("A", "A", 1), \ ("A", "A", 2), \ ("A", "A", 3), \ ("A", "B", 4), \ ("A", "B", 5), \ ("A", "C", ...

image image
  • 28800 Views
  • 3 replies
  • 10 kudos
Latest Reply
NhatHoang
Valued Contributor II
  • 10 kudos

Hi,​In my experience, if you use dropDuplicates(), Spark will keep a random row.​Therefore, you should define a logic to remove duplicated rows.

  • 10 kudos
2 More Replies
Shalabh007
by Honored Contributor
  • 3737 Views
  • 5 replies
  • 19 kudos

Practice Exams for Databricks Certified Data Engineer Professional exam

Can anyone help with official Practice Exams set for Databricks Certified Data Engineer Professional exam, like we have below for Databricks Certified Data Engineer AssociatePractice exam for the Databricks Certified Data Engineer Associate exam

  • 3737 Views
  • 5 replies
  • 19 kudos
Latest Reply
Nayan7276
Valued Contributor II
  • 19 kudos

hi @Shalabh Agarwal​ I am not able to find any official practice paper. it is still not available.

  • 19 kudos
4 More Replies
AnubhavG
by Contributor
  • 2184 Views
  • 1 replies
  • 2 kudos

External APIs

Does databricks provide a way to integrate to external sw/API's? Whether it is in the form of UDF/external function? Can somebody point me how this can be achieved? My use case is to talk to external API's from databricks to perform certain operation...

  • 2184 Views
  • 1 replies
  • 2 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 2 kudos

You can write your own code to fetch data from external API.Example: https://insightsndata.com/how-to-call-rest-api-store-data-in-databricks-8383f2458d7d

  • 2 kudos
Ruby8376
by Valued Contributor
  • 2609 Views
  • 5 replies
  • 0 kudos

Resolved! Is there a way to get cdc data from salesforce to databricks? Can a smart pipeline be built to get near real time data from salesforce into delta lake?

Currently, we have daily batch running to extract data from salesforce into csv file (adls) which is further copied to delta tables for transformation. We are now looking to implement a solution which can extract real-time data changes on salesforce ...

  • 2609 Views
  • 5 replies
  • 0 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 0 kudos

On Azure you can try using SAP CDC connector for Data Factory:https://learn.microsoft.com/en-us/azure/data-factory/sap-change-data-capture-introduction-architecture

  • 0 kudos
4 More Replies
Himanshi
by New Contributor III
  • 1192 Views
  • 1 replies
  • 6 kudos

How to exclude the existing files when we need to move the streaming job from one databricks workspace to another databricks workspace that may not be compatible with the existing checkpoint state to resume the stream processing?

We do not want to process all the old files, we only wanted to process latest files. whenever we use the new checkpoint path in another databricks workspace, streaming job is processing all the old files as well. Without autoloader feature, is there ...

  • 1192 Views
  • 1 replies
  • 6 kudos
Latest Reply
Shalabh007
Honored Contributor
  • 6 kudos

@Himanshi Patle​ in spark streaming there is one option maxFileAge using which you can control which files to process based on their timestamp.

  • 6 kudos
Harun
by Honored Contributor
  • 2968 Views
  • 1 replies
  • 3 kudos

How to change the number of executors instances in databricks

I know that Databricks runs one executor per worker node. Can i change the no.of.exectors by adding params (spark.executor.instances) in the cluster advance option? and also can i pass this parameter when i schedule a task, so that particular task wi...

  • 2968 Views
  • 1 replies
  • 3 kudos
Latest Reply
karthik_p
Esteemed Contributor
  • 3 kudos

@Harun Raseed Basheer​ usually for 1 worker node 1 executor will be there, if we need to split that executor within that worker node itself, we can do that based on memory that core has been assigned , below configs can be use spark.executor.coresspa...

  • 3 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels