Databricks Community

Taha_Hussain · ‎10-05-2022

Register for Databricks Office Hours

October 12: 8:00 - 9:00 AM PT | 3:00 - 4:00 PM GMT

October 26: 11:00 AM - 12:00 PM PT | 6:00 - 7:00 PM GMT

Databricks Office Hours connects you directly with experts to answer all your Databricks questions.

Join us to:

• Troubleshoot your technical questions

• Learn the best strategies to apply Databricks to your use case

• Master tips and tricks to maximize your usage of our platform

Register now!

Taha_Hussain · ‎10-12-2022

Here are some of the Questions and Answers from the 10/12 Office Hours (note: certain questions and answers have been condensed for reposting purposes):

Q: What is the best approach for moving data from on-prem S3 storage into cloud blob storage into delta tables? Is there any data bricks sample code available for this use case?

A: This would not really be Databricks Specific. What I mean by that is that you could of course read directly from on-prem using spark.read or jdbc but this is not really great as a mechanism to move TB's of data in terms of performance. For this you would be better off relying on Cloud-native tools. If AWS, something like their Cloud Data Migration service. That way you can move the majority of it in the most performant way. For ongoing, You can also set up CDC sources (if the source is a RDBMS for example) which dumps those logs to S3 and then use something like Autoloader in Databricks to ingest it from S3.

Q: We have a big data source. Sometimes we just need to find distinct field values within this data source for one-off situations. The query even after optimizing for partitions takes a long time. Is there any best practice or guides that we can follow to ease this process?

A: This is just the price of querying pretty much any source that's not Delta. With Delta you can ANALYZE the table and the optimizer can use those statistics along with the statistics captured with the delta log, to "skip" files to read, and just sometimes read the stats from the metadata itself. For any other source, you would most likely have to read the full data every time you need to do this operation which would be expensive. If the source is a RDBMS, then unless Spark can push-down to the RDMS engine, you would end up with the same problem. Is it possible for you o copy that data over to cloud storage (as a Delta table) and then do your analysis there?

Q: My question is for Databricks data cleansing, how do you recommend checking US addresses? Does Databricks provide a service or have a recommendation for the cleansing of addresses?

A: You will need to read the data and define your filter's logic. There's no Databricks or even Spark specific way to do this. You would need to use an open source library that can do this, or pay for some true Address Cleansing service such as LexisNexis. Not sure what (if any) there is a open source lib to these types of services.

Q: What are the limitations of the shell commands in databricks? Not sure if the databricks people present here are familiar with it, but I successfully installed texlive to generate pdf. However, when using the `pdflatex` command on a specify I get a weird error that does not pop up when I run the same code locally. The log ends with this: \openout4 =

`report.ist'. {/var/lib/texmf/fonts/map/pdftex/updmap/pdftex.map} . No error is shown in the log. However databricks console also add this text after the log ends "pdflatex: report: Operation not supported" report.tex is the file I am trying to process

A: The error code suggests that Databricks is unable to proceed with the operation and that the pdf generation is likely failing. Checking the driver logs for when you got this error will help understand what/why it occurred.

Q: What would you suggests as the preferred method to upsert values into a larger delta table from a smaller one based on a single key column? (The larger table has tens of thousands of rows, the smaller one has hundreds and both has up to 2 thousand columns.)

A: If you are using Delta, the MERGE will do this for you as efficiently as possible and you would provide that key as the join key for the merge. Note, this still means re-writing a file for a 1 row change even if the file was 1 GB. Just the way parquet files need to be written because you can't update a parquet file. But Databricks has a feature coming in preview called "Deletion Vectors" which is going to make this much more efficient, by rewriting less and using logical flags to determine if a row was updated (and rewriting only only the row that needs to be rewritten). Also make sure to add as much conditional logic as possible in the merge, so you don't have to rewrite a row if you don't really want to.

Q: Someone just wrote in slack “Is it just me or is Databricks slow today?” - The person is a “user” and is describing her experience, without really targeting what is slow or providing more details - Q: How should I investigate this?

A: In either case you would need to either create a support ticket or contact your account team (if you haven't paid for support). "slow" could be a UI issue, if so, it would be just a webapp issue on the Databricks control plane or if the jobs are slower, then you/support would need to dig into the logs. Support would have access to more backend logs than you would which helps them diagnose and fix. Also, check status.databricks.com (suggest subscribing to this) to see if there are any announced outages for your cloud/region.

Q: How to Connect OneDrive data in Databricks Notebook?

A: Please check this link which details the API that allows users to download files from OneDrive.

Q: We’re using terraform to create infrastructure in Azure - Multiple keyvaults for different “tenants” - and need to setup secret scope for each. However, only a user AAD token can be used to create keyvault backed secret scopes - What is your recommended way to deal with this?

A: Currently, it's only possible to create Azure Key Vault scopes with Azure CLI authentication and not with Service Principal. That means, az login --service-principal --username $ARM_CLIENT_ID --password $ARM_CLIENT_SECRET --tenant $ARM_TENANT_ID won't work as well. This is the limitation of underlying cloud resources. You can have a look at the document here for additional details.

Q: Are there any certification exam practices I can practice?

A: You can check our certifications website, we offer practice tests for all the exams.

Q: What would be the fastest way to ingest the data from cloud storage, where the path location doesn't follow the lexographical ordering and we have thousands of files and directories created daily, which means that listing all directories makes it a little inefficient

A: Autoloader will be the best tool. Please check the Docs here

Q: Are notebooks are preferred way of running the spark jobs or using the jars preferred way of running the jobs?

A: You can use notebooks or Jar files. Both are great tools and you can use either one.

Hubert-Dudek · ‎10-14-2022

Thank you for great list of questions

Databricks Community

Register for Databricks Office HoursOctober 12: 8:00 - 9:00 AM PT | 3:00 - 4:00 PM GMTOctober 26: 11:00 AM - 12:00 PM PT | 6:00 - 7:00 PM GMT Databric...

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences