Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import DataFrame, Column
from pyspark.sql.types import Row
import dlt
S3_PATH = 's3://datalake-lab/xxxx/'
S3_SCHEMA = 's3://datalake-lab/xxxx/schemas/'
@dl...
How to check if a file exists in DBFS?Let's write a Python function to check if the file exists or not-------------------------------------------------------------def file_exists(path): try: dbutils.fs.ls(path) return True except ...
At the moment, Azure Databricks has the feature to use AzureAD login for the workspace and create single user clusters with Azure Data Lake Storage credential passthrough. But this can only be used for Data Lake Storage.Is there already a way, or are...
I have exactly the same issue. I have the need to call a protected API within a notebook but have no access to the current user's access token. I've had to resort to nasty workarounds involving installing and running the Azure CLI from within the not...
I'm trying to extract the text data from image file in Databricks notebook I have installed below libraries using pip command: %pip install pytesseract tesseract pillow --upgradebut it didn't work and threw below error pytesseract.pytesseract.Tessera...
Hi @neha_ayodhya - can you please try the following via an init script to the Databricks cluster
sudo apt-get update -y
sudo apt-get install -y tesseract-ocr
sudo apt-get install -y libtesseract-dev
/databricks/python/bin/pip install pytesseract
a...
Hi,I'm trying to purge a table of stale data. My databricks host is on cloud.databricks.com.I've set delta.deletedFileRetentionDuration=interval 7 days, deleted many (billions) rows, and followed up with VACUUM tablename RETAIN 168 HOURS, however my ...
Hello!I'm using Databricks with Azure. On a daily basis, I check the status of numerous jobs through the Spark UI. At the moment, the Spark UI does not refresh by itself. I have to refresh the webpage to get the latest status. I wonder if there is a ...
Hello, I have been reading databricks Auto Loader documentation about cloudFiles.backfillInterval configuration, and have a question about a specific detail on how it works still. I was only able to find examples of it being set to 1 day or 1 week. ...
Hey @therealchainman The last backfill (lastBackfillFinishTimeMs) will be recorded as part of the checkpoint -> offset files, this helps the autoloader to know when the last backfill is triggered and to trigger the next periodic backfill.Hope this an...
I have a number of functions in a schema in a catalog in Unity Catalog, is there a coding way to be able to change the owner of the functions created without having to do it manually via the gui?
Check this notebook out, I assume you can change it a bit to do what you want. https://docs.databricks.com/en/_extras/notebooks/source/set-owners-notebook.htmlI assume you can loop through the rows in the resulting df (that has the ALTER statements),...
Hi all,We have a job that combines historical tables with live tables to give us up to date information. It works for almost all of the tables in our source postgres database, but there's one table that keeps giving the following error. Any ideas why...
I try several Spark Deep Learning inference notebooks on Windows. I run Spark in standalone mode with 1 worker with 12 cores (both driver-memory and executor-memory are set to 8G). I always get the same error when applying the deep learning model to ...
Hello,I'm currently seeing a rather cryptic error message whenever I try to import the deltalake library into Databricks (without actually doing anything else).import datalake"ImportError: /local_disk0/.ephemeral_nfs/envs/pythonEnv-cbe496f6-d064-40ae...
Hi!We have job, that runs every hour. It extracts data from the API and saves to the databricks table.Sometimes job fails with error "org.apache.spark.SparkException". Here is the full error:An error occurred while calling o7353.saveAsTable.
: org.ap...
Hello,We are receiving DB CDC binlogs through Kafka and synchronizing tables in OLAP system using the apply_changes function in Delta Live Table (DLT). A month ago, a column was added to our table, but due to a type mismatch, it's being stored incorr...
I have tried many times all the answers from the internet and stackover flowI have already created the config section before this steps, it passed but this below step is not executing.
We were getting this problem when using directory-scoped SAS tokens. While I know there are a number of potential issues that can cause this problem, one potential explanation is that it turns out there is an undocumented spark setting needed on the ...