Spark - Cluster Mode - Driver
When running a Spark Job in Cluster Mode, how does Spark decide which worker node to place the driver resources ?
- 1733 Views
- 0 replies
- 0 kudos
When running a Spark Job in Cluster Mode, how does Spark decide which worker node to place the driver resources ?
I am baffled by the behaviour of Databricks:Below you can see the contents of the directory using dbutils in Databricks. It shows the `test.xlsx` file clearly in directory (and I can even open it using `dbutils.fs.head`) But when I go to use panda.re...
Hey, I encountered it recently. I can see you are using the shared cluster, try switching to a single user cluster and it will fix it.Can someone let me know why it wasn't working w a shared cluster?Thanks.
Hi everyone,I have a concern that is there any way to read stream from 2 different kafka topics with 2 different in 1 jobs or same cluster? or we need to create 2 separate jobs for it ? (Job will need to process continually)
I have a merge function for streaming foreachBatch kind ofmergedf(df,i): merge_func_1(df,i) merge_func_2(df,i)Then I want to add new merge_func_3 into it. Is there any best practices for this case? when streaming always runs, how can I process...
It's more a spark question then a databricks question, I'm encountering an issue when writing data to an Oracle database using Apache Spark. My workflow involves removing duplicate rows from a DataFrame and then writing the deduplicated DataFrame to ...
The difference in behaviour between using foreachPartition and data.write.jdbc(...) after dropDuplicates() could be due to how Spark handles data partitioning and operations on partitions. When you use foreachPartition, you are manually handling the ...
OverviewTo update our Data Warehouse tables, we have tried two methods: "CREATE OR REPLACE" and "MERGE". With every query we've tried, "MERGE" is slower.My question is this: Has anyone successfully gotten a "MERGE" to perform faster than a "CREATE OR...
Hi @Graham Can you please try Low Shuffle Merge [LSM] and see if it helps? LSM is a new MERGE algorithm that aims to maintain the existing data organization (including z-order clustering) for unmodified data, while simultaneously improving performan...
I saved a file with results by just opening a file via fopen("filename.csv", "a").Once the execution ended (and the cluster shutted down) I couldn't retrieve the file.I found that the file was stored in "/databricks/driver", and that folder empties w...
Hi team,When we reading the CSV file from azure blob using databricks we are not getting any key error and able to read the data from blob .But if we are trying to read XML file it failed with key issue invalid configuration . Error:Failure to inti...
We have a monorepo so our pyspark notebooks do not use namespace relative to the root of the repo. Thus the default sys.path of repo root and cwd does not work. We used to package a whl dependency but recently moved to having code update sys.path wit...
Hi @Liliana , Just a friendly follow-up. Have you had a chance to review my colleague's response to your inquiry? Did it prove helpful, or are you still in need of assistance? Your response would be greatly appreciated.
Databricks python sql script gives below error: Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
Hi @hal-qna, Just a friendly follow-up. Have you had a chance to review my colleague's response to your inquiry? Did it prove helpful, or are you still in need of assistance? Your response would be greatly appreciated.
I'll try to answer this in the simplest possible way 1. Spark is an imperative programming framework. You tell it what it to do, it does it. DLT is declarative - you describe what you want the datasets to be (i.e. the transforms), and it takes care ...
I want to test a pipeline created using dlt and python in vscode.
Hey @rt-slowth check out this tutorial. You won't get debugging in VSCode yet, but this workflow is pretty nice.
Hi Team,We have a requirement to Encrypt PII data in Silver layer. What is the best way to implement this in DLT? and only users that has security privileges are able to decrypt the PII info.I have done this in the past using Structured Streaming but...
Can you show me how to use the functions built in pyspark using DLT please.Also, trying to implement column/row level security in silver tables that is generated by DLT, but giving me the following error[RequestId=35024c5d-ad05-4f68-a4cb-f3a723f66e1c...
Trying to use displayHTML from w/in a Python module gets a Python exception:NameError: name 'displayHTML' is not definedand I've found no way around this. It seems to be something at the UI layer or something, not a Python function that can be refere...
Holy Guacamole Batman! It works finally!!!! Wow, thanks @ptweir That's awesome! I can go back and update my doc (and code, to just use databricks the same, now, and Jupyter!) and it'll work by default. It's great they fixed it, shame they never told ...
Hello,We have encountered a weird issue in our (old) set-up that looks like a bug in the Unity Catalog. The storage account which we are trying to persist is configured via External Volumes.We have a pipeline that gets XML data and stores it in an RD...
I will post here what worked resolving this error for us, in case someone else in the future encounters this.It turns out that this error appears in this case, when we were using the below command while the directory 'staging2' already exists. To avo...
| User | Count |
|---|---|
| 1644 | |
| 793 | |
| 554 | |
| 349 | |
| 287 |