cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Abela
by New Contributor III
  • 9277 Views
  • 3 replies
  • 7 kudos

Resolved! Databricks drop and remove s3 storage files safely

After dropping a delta table using DROP command in databricks, is there a way to drop the s3 files in databricks without using rm command? Looking for a solution where junior developers can safely drop a table wihout messing with the rm command where...

  • 9277 Views
  • 3 replies
  • 7 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 7 kudos

Hi @Alina Bella​ ,Like @Hubert Dudek​ mentioned, we have a best practice guide for dropping managed tables. You can find the docs here

  • 7 kudos
2 More Replies
itay
by New Contributor II
  • 2415 Views
  • 2 replies
  • 1 kudos

Streaming with runOnce and groupBy window queries

I have a streaming job running a groupBy query with a Window of 3 days. The query is searching for different types of events.The stream is configured with runOnce and there is a job scheduled for every hour.Now, I'm not sure what data is processed ea...

  • 2415 Views
  • 2 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 1 kudos

Hi @itay k​ ,You will need to take a look at the Progress Reporter. This will show the Micro-batch JSON metrics. For example, the metric called "numInputRows" which will display the number of input rows that it processed for the micro-batch. You will...

  • 1 kudos
1 More Replies
kmartin62
by New Contributor III
  • 7329 Views
  • 9 replies
  • 4 kudos

Resolved! Configure Databricks (spark) context from PyCharm

Hello. I'm trying to connect to Databricks from my IDE (PyCharm) and then run delta table queries from there. However, the cluster I'm trying to access has to give me permission. In this case, I'd go to my cluster, run the cell which gives me permiss...

  • 7329 Views
  • 9 replies
  • 4 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 4 kudos

"I'm trying to connect to Databricks from my IDE (PyCharm) and then run delta table queries from there."If you are going to deploy later your code to databricks the only solutions which I see is to use databricks-connect or just make development envi...

  • 4 kudos
8 More Replies
prasadvaze
by Valued Contributor II
  • 24945 Views
  • 7 replies
  • 3 kudos

Resolved! How to make delta table column values case-insensitive?

 we have many delta tables with string columns as unique key (PK in traditional relational db) and we don't want to insert new row because key value only differs in case. Its lot of code change to use upper/lower function on column value compare (in ...

  • 24945 Views
  • 7 replies
  • 3 kudos
Latest Reply
lizou
Contributor II
  • 3 kudos

Well, the unintended benefit is now I am using int\big int as surrogate keysfor all tables (preferred in DW). All joins are made on integer data types. Query efficiency is also improved.The string matching using upper() is done only on ETL when com...

  • 3 kudos
6 More Replies
Nuthan_1994
by New Contributor II
  • 4493 Views
  • 3 replies
  • 3 kudos

Resolved! Installing new libraries on Azure Databricks Clusters

Hi Everyone,I was trying to install the newest python version on the Databricks Clusters and it has the runtime version 7.3 LTS, but no matter how many times I try it keeps installing the 3.7.5 version of python.I know that Runtime version 7.3 LTS co...

  • 4493 Views
  • 3 replies
  • 3 kudos
Latest Reply
dazfuller
Contributor III
  • 3 kudos

I've done this before using a custom docker image, but even then the runtime itself continues to use the version of python 3 which is installed as part of the OS. The easiest way to get to a newer version is to use a newer runtime. If you're sticking...

  • 3 kudos
2 More Replies
Anonymous
by Not applicable
  • 1931 Views
  • 1 replies
  • 1 kudos

Resolved! Access to Cluster Logs for non-admins

Suppose I have a DevOps team that needs near real-time access to cluster logs to troubleshoot job failures. What is the best way for me to grant access to view logs without granting them admin access?

  • 1931 Views
  • 1 replies
  • 1 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 1 kudos

Please use logging option and set destination for sending logs in cluster settings to other Azure Blob or S3 storage (need to be mounted first):

  • 1 kudos
User16857281869
by New Contributor II
  • 2743 Views
  • 1 replies
  • 1 kudos

Resolved! Why do I see a cost explosion in my blob storage account (DBFS storage, blob storage, ...) for my structures streaming job?

Its usually one or more of the following reasons:1) If you are streaming into a table, you should be using .Trigger option to specify the frequency of checkpointing. Otherwise, the job will call the storage API every 10ms to log the transaction data...

  • 2743 Views
  • 1 replies
  • 1 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 1 kudos

please mount cheaper storage (LRS) to custom mount and set there checkpoints,please clear data regularly,if you are using forEac/forEatchBatchh in stream it will save every dataframe on dbfs,please remember not to use display() in production,if on th...

  • 1 kudos
User16857281869
by New Contributor II
  • 2402 Views
  • 1 replies
  • 1 kudos

Resolved! What is the best way to do time series analysis and forecasting with Spark?

We have developed a library on spark which makes typical operations on time series much simpler. You can check the repo in Github for more info. You could also check out one of our blogs which demos an implementation of a forecasting usecase with S...

  • 2402 Views
  • 1 replies
  • 1 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 1 kudos

Currently on databricks there is MLFlow with forecasting option - please check it.

  • 1 kudos
brickster_2018
by Databricks Employee
  • 1738 Views
  • 1 replies
  • 0 kudos
  • 1738 Views
  • 1 replies
  • 0 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 0 kudos

This is a lit of configuration keys to enable or alter the blacklist mechanism:spark.blacklist.enabled – set to Truespark.blacklist.task.maxTaskAttemptsPerExecutor (1 by default)spark.blacklist.task.maxTaskAttemptsPerNode (2 by default)spark.blacklis...

  • 0 kudos
DievanB
by New Contributor
  • 2138 Views
  • 1 replies
  • 0 kudos

pyspark: How to run selenium in UDF

Hi all, I am building a webscraper to get prices of certain EAN's from the amazon website. Therefore I use selenium to get the product links. I wrote te following function to get the productlinks based on a EAN: def getProductLinkAmazonPY(EAN): st...

  • 2138 Views
  • 1 replies
  • 0 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 0 kudos

UDF functions are serialized and then executed on executors. I don't think it will be possible with Selenium.

  • 0 kudos
Emre
by New Contributor II
  • 1799 Views
  • 1 replies
  • 2 kudos

Resolved! The license of JDBC connector for BI vendors

Hey all,We would like to support Databricks in our BI tool, which is an open-source Java application. (See https://github.com/metriql/metriql)In order to connect Databricks, we need to use the JDBC connector similar to the other BI tools such as Look...

  • 1799 Views
  • 1 replies
  • 2 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 2 kudos

It doesn't look so bad after all (mean terms and conditions on https://databricks.com/jdbc-odbc-driver-license )but I think the best solution is to open ticket via https://databricks.com/company/contact

  • 2 kudos
missyT
by New Contributor III
  • 1926 Views
  • 1 replies
  • 3 kudos

Is there a reason lists don't have a .sum() method?

I do a lot of work with numpy arrays and pytorch tensors, but occasionally throw some native lists around. I naturally want to write <list>.sum(), which would work for these other third-party iterables, but doesn't work for native lists.It'd be very ...

  • 1926 Views
  • 1 replies
  • 3 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 3 kudos

I think reason is that list can contain different type of objects than just integers and floats (so nested lists, string and all possible other kind of objects) so it doesn't make sense to implement .sum method as it would fail in many cases.

  • 3 kudos
Anonymous
by Not applicable
  • 732 Views
  • 0 replies
  • 0 kudos

www.jamboreeindia.com

Jamboree is the leading institute offering specialized classroom and online test prep solutions for study abroad entrance exams like GMAT, GRE, SAT, TOEFL and IELTS.https://www.jamboreeindia.com/

  • 732 Views
  • 0 replies
  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels