Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Hi Team,I am trying to get the latest files from an ADLS mount point directory. I am not sure how to extract latest files ,Last modified Date using Pyspark from ADLS Gen2 storage account. Please let me know asap. Thanks!
I am looking forward your re...
Hi @pankaj92 ,I wrote a Python code to pick a latest file from mnt location ,import ospath = "/dbfs/mnt/xxxx"filelist=[]for file_item in os.listdir(path): filelist.append(file_item)file=len(filelist)print(filelist[file-1])Thanks
I'm using Databricks on AWS. Our clusters are typically in PENDING state for 5-8 minutes after they are created. I would like to find out why (ec2 instance provisioning? docker image download is slow? ...?). The cluster logs are not helpful enough be...
hi @Sergey Ivanychev while the cluster is starting, you can see the status on the compute page. Hover the mouse pointer to the green rotating circle on the left of the cluster name. It will give a notification of what is happening on the cluster. Wh...
Hello,We are having issues installing the pdpbox library on a fresh cluster. This includes trying to upload and install a whl file, or using pip in a workbook. I have attached an example of an error received. Can anybody assist with installing the...
PDPbox is updated rarely, and it requires older versions of matplotlib (3.1.1)https://github.com/SauceCat/PDPboxIt tries to install but fails because matplotlib requires pkgconfig.The solution to that is to use Machine Learning runtime. There it will...
When updating an expired Azure DevOps personal access token (PAT) for git integration, I get the error message "Failed to save. Please try again.". The error persists with different tokens. Previously (months ago), updating the token did not result i...
I installed the CLI but unable to configure it to connect to my instance -- as I am unable to find the "Generate Access tokens" option under User Settings page.Documentation does not say whether this feature is disabled for community edition.
hi @Al Jo we understand your interest in learning Databricks. However, the community edition is limited in features. Certain features are available only in the paid version. If you are interested, to use the full features, then I would suggest you g...
I want to know if what I describe below is possible with AutoLoader in the Google Cloud Platform.Problem Description:We have GCS buckets for every client/account. Inside these buckets is a path/blob for each client's instances of our platform. A clie...
Hi,I'm trying to load this json file which contains the colon character in its name: file_name.2022-03-05_11:30:00.json but I get the error in screenshot below saying that there is a relative path in an absolute url - Any idea how to read this file...
Hi @Laura Blancarte I hope that @Pearl Ubaru's answer would have helped you in resolving your issue.Please let us know if you need more help on this.
Hi, I am trying to take advantage of the treasure trove of the information that metastore contains and take some actions to improve performance. In my case, the metastore is managed by databricks, we don't use external metastore.How can I connect to ...
@AKSHAY PALLERLA to get the jdbc/odbc information you can get it from the cluster configuration. In the cluster configuration page, under advanced options, you have JDBC/ODBC tab. Click on that tab and it should give you the details you are looking ...
tl;dr: A cell that executes purely on the head node stops printed output during execution, but output still shows up in the cluster logs. After execution of the cell, Databricks does not notice the cell is finished and gets stuck. When trying to canc...
As that library work on pandas problem can be that it doesn't support pandas on spark. On the local version, you probably use non-distributed pandas. You can check behavior by switching between:import pandas as pd
import pyspark.pandas as pd
Summary of the problemWhen mounting an S3 bucket via Terraform the creation process is frequently timing out (running beyond 10 minutes). When I check the Log4j logs in the GP cluster I see the following error message repeated:```22/07/26 05:54:43 ER...
I'm running a Java application that registers a CSV table with HIVE and then checks the number of rows imported. Its done in several steps.:Statement stmt = con.createStatement();....stmt.execute( "CREATE TABLE ( <definition> < > );.....ResultSet rs...
@Reto Matter Are you running a jar job or using dbconnect to run java code? Please provide how are you trying to make a connection and full exception stack trace.
Hey all,My aim is to validate a given SQL string without actually running it.I thought I could use the `EXPLAIN` statement to do so.So I tried using the `databricks-sql-connector` for python to explain a query, and so determine whether it's valid or ...
Hey guys,We're considering Delta Lake as the storage for our project and have a couple questions. The first one is what's the pricing for Delta Lake - can't seem to find a page that says x amount costs y.The second question is more technical - if we...
delta lake itself is free. It is a file format. But you will have to pay for storage and compute of course.If you want to use Databricks with delta lake, it will not be free unless you use the community edition.Depending on what you are planning to...
Try looking into the structured streaming API. There you will learn about how to join streams and static data, how to set triggers for the streams, minibatching and other things that are important to the reliability of your application.Structured Str...