cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Register for Databricks Office HoursSeptember 28: 11:00 AM - 12:00 PM PT | 6:00 - 7:00 PM GMT Databricks Office Hours connects you directly with exper...

Taha_Hussain
Valued Contributor II

Register for Databricks Office Hours

September 28: 11:00 AM - 12:00 PM PT | 6:00 - 7:00 PM GMT

Databricks Office Hours connects you directly with experts to answer your Databricks questions.

Join us to:

• Troubleshoot your technical questions

• Learn the best strategies to apply Databricks to your use case

• Master tips and tricks to maximize your usage of our platform

Register now!

2 REPLIES 2

Taha_Hussain
Valued Contributor II

Here are some of the questions and answers from the September 28 Office Hours. Join our October 12 session to get your questions answered!

Q: Is there any benefit to transforming data via dataframes in pyspark vs using databricks SQL?

A: Performance-wsie they will match But usability wise SQL will be the better option due to its wide usability and readability

Q: I have a pyspark streaming script written in databrick and the script keeps dying. I have a hunch that its the cluster its running on. What is the proper way to run streaming in databricks? What is the proper to monitor spark streaming?

A: Great question. 1) I would check with your Admin to see if you have the ability to attach to the cluster. 2) If you can attach to the cluster, check the

configuration of the cluster. It may be the case that the cluster has very little memory. So streaming usually spark streaming usually works better if you have more performance, more memory-optimized machines 3) If none of these things are the case, check the auto termination time. Could be that the

query is not getting attached, leading to your notebook not getting attached to the cluster, hence it is getting terminated. 4) I would also recommend using Autoloader or Delta Live Tables instead which are better ways to handle and monitor streaking.

Q: I can access data on dbfs, but can't see it in the UI. Why?

A: This may be due to a lack of object privileges from your workspace admin. Please check out this doc for more information on object privileges.

Q: Is there a way to create a template notebook so that when starting to create a notebook, we can use it, which could contain a structure of the notebook like secion: read, transform, write, and other required documentation. Even better would be if Databricks can suggest to create a notebook from existing template (custom made notebook template).

A: We currently do not have a feature to create template notebooks, but you can use the notebook in HTML format to mark your cell for what usages you have the cell written and differentiate. This doc should help. You can also check out our Notebook gallery. This is a great idea to suggest to our Idea Portal.

Q: We've been having a regular issue with a production workflow. On some runs, we encounter the error Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 54 in stage 127.1 failed 4 times, most recent failure: Lost task 54.3 in stage 127.1 (TID 7380) (10.149.195.104 executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: worker lost. When retrying the workflow it always succeeds. Do you know of a permanent fix for this issue? Additionally, the job is running on i3.4xlarge. Would z1d work better in this case?

A: You need to check the cluster usages. Maybe try ganglia matrices and monitor the log from there. Are you seeing any GC failure? You may need to pull more logs for the same if moving to memory consuming cluster does not help. This may be due to it trying to consume more memory from your cluster, so you should consider trying to increase it to a memory-consuming instance type. As for how the job is running, z1d should work, though even this depends upon the job you are running. Monitoring the ganglia will give you more insight how the cluster and (executors) level is behaving.

Q: Recommendation options for Auto Loader: Use case is the customer places many data files in the raw zone in one directory, the files are overwritten once a day... many data sets are in this directory.... I assume that I should create many Auto Loader jobs specific to each data set (file). How can Auto Loader look into a directory and pull selected files... One point of confusion is that Auto Loader gets notified when a new file lands BUT there are different data sets landing... so how do I get the right Auto Loader process to fire??

A: If I am understanding correctly, your use case is asking about a file name pattern matching, This is not currently supported, each data set is expected to be in a separate folder. You can find the options for to help with what you are trying to achieve, here.

Q: A question about Spark. Are Vectorized Spark Native functions (through pySpark), always faster than UDF using Scala?

A: Unfortunately there is no one answer that can fit in this case. Because as you might know partitioning and data key distribution... In general yes vectorized read is expected to outperform but this can be affected by other circumstances.

Q: When using azure private link, why is it suggested to use a different vnet for front end private endpoint? If we use a different vnet for front end private endpoint, should that vnet be shared across all workspaces in the region? Or are you suggesting 2 vnets per workspaces that use private link?

A: It is applied to per MG. This is due to a back end need for Secure cluster connectivity. You can find more additional details here with architecture

Taha_Hussain
Valued Contributor II

Cont...

Q: Do generated columns in Delta Live Tables include IDENTITY columns?

A: My understanding is that generated columns in Delta Live Tables do not contain IDENTITY columns. Here is more on generated columns in DLT.

Q: We store raw data for each customer in a separate folder on S3. Even though it has some pros, reading data folder by folder for processing data is taking a lot of time. Is it recommended to save data this way, or you suggest a better way? (maybe having one parquet file and append new data to it? or...?)

A: In this case I think you are spot on. So when you are basically trying to analyze the data, query the data, this methodology can speed up results. You can go with this current format or you can store it in delta format because delta is more optimized and it gives you the ability to go back into time and time travel. It also has many other capabilities like z ordering as well as optimization. 

Q: I'm a contractor who expects to use Databricks for customers. How should I think about user and data management so that I can smoothly hand off projects to the client once they are completed?

A: You should use Unity Catalog which allows you to govern all your data and all your data assets like notebooks machine learning models. In addition, you can use Delta Sharing, which works hand in hand with Unity Catalog and lets you share model artifacts, notebooks, tables or underlining files with customers - Databricks and non-Databricks.

Q: customer wants to use Databricks and synapse together. Any advice? Can synapse read the delta tables generated from Databricks?

A: You can use Databricks and Synapse together. But I don't see a reason why you would use synapse along with Databricks. Unless there is a very specific requirement because everything that synapse does Databricks can do and more and at a cheaper cost. If you are keen on doing so however, feel free to read this doc for additional details.

Q: Do you have recommended tutorials for building an image classifier based on images on S3 (including perhaps transfer learning or using existing models)? What would the main benefits of Databricks for this application? I've been looking at the SageMaker options which look pretty straightforward, and I'm not sure how deep to dig in on Databricks as an alternative.

A: Great question! We have a host of different type of libraries that you can deploy to build image classifiers. So, whenever you are creating a cluster, just select the ml run times for that cluster and choose the latest long term support run time, which would have a lot of these libraries. You can also install custom scripts on the cluster if you want by using in its scripts.

Q: Do you have any information on the usage of the Delta Live Table? For example, the success stories. This will be useful for us to make a decision on whether or not to use delta live tables.

A: We have a lot of customers that are using Delta Live Tables. I have customers who are basically creating pipelines using Delta Live Tables and for instance we have one of our biggest customers who is using Delta Live Tables to simplify pipeline creation. This gives you data quality expectations as well as makes data engineering very simple for your data engineers and even accessible for your data and analytics people. It also makes streaming and batch processing very easy.

Q: What is the solution to process a big json file like 100 GB file and complex array?

A: Databricks is well equipped to handle large files. I suggest you use a larger cluster like a i3.4xlarge. Also, once you get this data into your storage you should definitely use commands like vacuum as well as optimized to distribute the data and vacuum to remove the older and unused files.

Q: I was a little confused on the use of the "pathGlobFilter" for Auto Loader, could you explain what this filter is used for?

A: These filters basically help you to filter the data. If all raw files are in one directory, you can pick certain files for each how to load a job. If the filter is not specified correctly then it will not pick up that pattern. So basically that means that your data that you do not want might sometimes flow into um into your delta lake as well as it will leading to less performance. Here is a community post with more details.

Q: How to pull from another branch while you are in a different branch,? For example, if I am working on a branch, and master branch contains changes ahead of my branch, and I want to incorporate them in my branch, how to rebase my branch with master branch in databricks?

A: Say for example your colleague is working on line number 10 to 15 but you are working on line number 20 to 25. Then once they push their changes you push their changes to the master branch they get merged and then appear created where you will review and approve those changes.

But if you want that, say for example your colleague has worked on line number 10 to 15 of a code and you want those changes to be reflected in your code as well. You will have to wait for those changes to be approved first merged into the master branch and then start working off it.

Q: How can I use Delta Table together with S3 Glacier (for old dates records)?

A: To retain some of those old records for your reference and all tables, then I suggest that you can push this data into S3 glacier as well. Having said that the retrieval times will still be the Sls that AWS suggests. So if you need to query this data for any reason then you might have to wait for those times

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group