cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Constantine
by Contributor III
  • 16844 Views
  • 3 replies
  • 7 kudos

Resolved! collect_list by preserving order based on another variable - Spark SQL

I am using databricks sql notebook to run these queries. I have a Python UDF like   %python   from pyspark.sql.functions import udf from pyspark.sql.types import StringType, DoubleType, DateType   def get_sell_price(sale_prices): return sale_...

  • 16844 Views
  • 3 replies
  • 7 kudos
Latest Reply
villi77
New Contributor II
  • 7 kudos

I had a similar situation where I was trying to order the days of the week from Monday to Sunday.  I saw solutions that use Python but was wanting to do it all in SQL.  My original attempt was to use: CONCAT_WS(',', COLLECT_LIST(DISTINCT t.LOAD_ORIG_...

  • 7 kudos
2 More Replies
Hubert-Dudek
by Esteemed Contributor III
  • 14090 Views
  • 4 replies
  • 4 kudos

Workflow timeout

Always set a timeout for your jobs! It not only safeguards against unforeseen hang-ups but also optimizes resource utilization. Equally essential is to consider having a threshold warning. This can alert you before a potential failure, allowing proac...

ezgif-2-283506cee0.gif
  • 14090 Views
  • 4 replies
  • 4 kudos
Latest Reply
sparkplug
New Contributor III
  • 4 kudos

We already have a policy and users are using clusters created with those to run their jobs. Since the policies are not based on job compute but on Power user compute, I am not able to set the job timeout_seconds. 

  • 4 kudos
3 More Replies
SS_RATH
by New Contributor
  • 4012 Views
  • 3 replies
  • 0 kudos

I have a notebook in workspace, how to know in which job this particular notebook is referenced.

I have a notebook in workspace, how to know in which job this particular notebook is referenced.

  • 4012 Views
  • 3 replies
  • 0 kudos
Latest Reply
Panda
Valued Contributor
  • 0 kudos

@SS_RATH @TamD There are couple of waysCall Databricks REST API  - Use the /api/2.1/jobs/list API to list and search through all jobs. Example: -  import requests workspace_url = "https://<databricks-instance>" databricks_token = "<your-databricks-t...

  • 0 kudos
2 More Replies
anonymous_567
by New Contributor II
  • 2140 Views
  • 1 replies
  • 0 kudos

Retrieve file size from azure in databricks

Hello, I am running a job that requires reading in files of different sizes, each one representing a different dataset, and loading them into a delta table. Some files are as big as 100Gib and others as small as 500 MiB. I want to repartition each fi...

  • 2140 Views
  • 1 replies
  • 0 kudos
Latest Reply
LindasonUk
New Contributor III
  • 0 kudos

 You could try utilise the dbutils files service like this:from pyspark.sql.functions import col, desc, input_file_name, regexp_replacedirectory = 'abfss://<container>@<storage-account>.dfs.core.windows.net/path/to/data/root'files_list = dbutils.fs.l...

  • 0 kudos
CE
by New Contributor II
  • 1327 Views
  • 2 replies
  • 0 kudos

Resolved! how to read reuqiremnet.txt in databrick workspace

Dear databrick team,I want to know if there is a method in Databricks equivalent to pip install -r requirements.txtThere are packages I want to install in this path: /Workspace/Users/xxx@domain.com/databrick_requirement.txtI have referred to the foll...

CE_0-1728877664241.png
  • 1327 Views
  • 2 replies
  • 0 kudos
Latest Reply
Panda
Valued Contributor
  • 0 kudos

@CE You can't directly access /Workspace paths like a traditional filesystem path. When you specify /Workspace/Users/xxx@domain.com/databrick_requirement.txt, %pip install cannot interpret it because the %pip magic command works with DBFS paths. Foll...

  • 0 kudos
1 More Replies
plakshmi
by New Contributor II
  • 2196 Views
  • 4 replies
  • 0 kudos

Resolved! Unable to read data from onprem sql server to databricks

I am trying to read data into data frame of data Bricks from on Prem SQL server but facing com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP connection to the host HYDNB875, port 1433 has failed. Error: "HYDNB875. Verify the connection prop...

  • 2196 Views
  • 4 replies
  • 0 kudos
Latest Reply
Panda
Valued Contributor
  • 0 kudos

@plakshmi Along with what @Rishabh-Pandey mentioned, follow these additional stepIf HYDNB875 can't be resolved, try using the server's IP address.Check for network routing issues between Databricks and the SQL Server using traceroute or ping.Review t...

  • 0 kudos
3 More Replies
Brad
by Contributor II
  • 866 Views
  • 1 replies
  • 0 kudos

Why there are many offsets in checkpoint

Hi team, I'm using trigger=availableNow to read delta table daily. The delta table itself is loaded by structured streaming from kinesis. I noticed there are many offsets under checkpoint, and when the job starting to run to get data from delta table...

  • 866 Views
  • 1 replies
  • 0 kudos
Latest Reply
Rishabh-Pandey
Esteemed Contributor
  • 0 kudos

@Brad  When you see the batch IDs listed in the logs (e.g., 186, 187, 188,...), these correspond to the batches of data that have been processed. Each batch ID represents a specific point in time in the streaming process, where the data was ingested,...

  • 0 kudos
TinasheChinyati
by New Contributor III
  • 22529 Views
  • 6 replies
  • 4 kudos

Is databricks capable of housing OLTP and OLAP?

Hi data experts.I currently have an OLTP (Azure SQL DB) that keeps data only for the past 14 days. We use Partition switching to achieve that and have an ETL (Azure data factory) process that feeds the Datawarehouse (Azure Synapse Analytics). My requ...

  • 22529 Views
  • 6 replies
  • 4 kudos
Latest Reply
Ben_dHont
New Contributor II
  • 4 kudos

@ChrisCkx and @bsanoopDatabricks is currently building a OLTP database functionality which is currently in private preview. It is a serverless PostgREST database. Documentation can be found here: [EXTERNAL] Online Tables REST - Private Preview Docume...

  • 4 kudos
5 More Replies
DevGeek
by New Contributor
  • 1079 Views
  • 1 replies
  • 0 kudos

Better Alternatives to ReadyAPI for API Testing?

I’m currently using ReadyAPI, mainly for API testing and some automation workflows, but I’m considering switching to something else. Has anyone here tried Apidog, Postman, or similar tools? I’m especially interested in how they compare in terms of pe...

  • 1079 Views
  • 1 replies
  • 0 kudos
Latest Reply
Brahmareddy
Esteemed Contributor
  • 0 kudos

Hi @DevGeek,How are you doing today?Consider trying Postman if you're looking for a robust tool with a wide range of features for API testing and automation. It’s known for its user-friendly interface and handles complex APIs and large datasets well,...

  • 0 kudos
dwalsh
by New Contributor III
  • 1619 Views
  • 2 replies
  • 0 kudos

Resolved! Cannot run ./Includes/Classroom-Setup-01.1 in Advanced Data Engineering with Databricks with 12

I am started the Advanced Data Engineering with Databricks course and have tried to run the includes code at the start. We recently had issues with 12.2 and moved to a newer version as there appeared to be some issues around setuptools.  If I run "%r...

  • 1619 Views
  • 2 replies
  • 0 kudos
Latest Reply
jainendrabrown
New Contributor II
  • 0 kudos

I am also having an issue. Running the first command itself. I am not sure how to download the Classroom-Setup data

  • 0 kudos
1 More Replies
MadelynM
by Databricks Employee
  • 13013 Views
  • 2 replies
  • 0 kudos

Delta Live Tables + S3 | 5 tips for cloud storage with DLT

You’ve gotten familiar with Delta Live Tables (DLT) via the quickstart and getting started guide. Now it’s time to tackle creating a DLT data pipeline for your cloud storage–with one line of code. Here’s how it’ll look when you're starting:CREATE OR ...

Workflows-Left Nav Workflows
  • 13013 Views
  • 2 replies
  • 0 kudos
Latest Reply
waynelxb
New Contributor II
  • 0 kudos

Hi MadelynM,How should we handle Source File Archival and Data Retention with DLT? Source File Archival: Once the data from source file is loaded with DLT Auto Loader, we want to move the source file from source folder to archival folder. How can we ...

  • 0 kudos
1 More Replies
rkshanmugaraja
by New Contributor
  • 1419 Views
  • 1 replies
  • 0 kudos

Copy files and folder into Users area , but Files are not showing in UI

Hi AllI'm trying to copy the whole training directory (which contains multiple sub folders and files) from my catalog volume area to each users area. From : "dbfs:/Volumes/CatalogName/schema/Training"To : "dbfs:/Workspace/Users/username@domain.com/Tr...

  • 1419 Views
  • 1 replies
  • 0 kudos
Latest Reply
radothede
Valued Contributor II
  • 0 kudos

 Hi  ,Do You want to use dbfs location on purpose, or You want to upload the training notebooks to Workspace/users location? The reason I'm asking is those are two different locations, although both are related to file management in Databricks. (see:...

  • 0 kudos
slakshmanan
by New Contributor III
  • 717 Views
  • 1 replies
  • 0 kudos

how to view row produced from rest API in databricks for long running queries in Running state

   print(f"Query ID: {query['query_id']} ,Duration: {query['duration']} ms,user :{query['user_display_name']},Query_execute :{query['query_text']},Query_status : {query['status']},rw:{query['rows_produced']}"") U am able to get rows_produced only for...

  • 717 Views
  • 1 replies
  • 0 kudos
Latest Reply
slakshmanan
New Contributor III
  • 0 kudos

https://{databricks_instance}.cloud.databricks.com/api/2.0/sql/history/queries?include_metrics=true

  • 0 kudos
surajitDE
by New Contributor III
  • 640 Views
  • 1 replies
  • 0 kudos

How to stop subsequent iterations in Databricks loop feature?

How to stop subsequent iterations in Databricks loop feature? sys.exit(0) or dbutils.notebook.exit() only marks the current task and series of tasks in sequence as failed, but continues with subsequent iterations.

  • 640 Views
  • 1 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Hi @surajitDECurrently, there is no out of the box feature to achieve that. What you can do, is to try to implement notebook logic that in case of error will cancel for each task run using REST API or python sdk:- use /api/2.1/jobs/runs/cancel endpoi...

  • 0 kudos
Volker
by Contributor
  • 3037 Views
  • 4 replies
  • 1 kudos

Asset Bundles cannot run job with single node job cluster

Hello community,we are deploying a job using asset bundles and the job should run on a single node job cluster. Here is the DAB job definition:resources: jobs: example_job: name: example_job tasks: - task_key: main_task ...

  • 3037 Views
  • 4 replies
  • 1 kudos
Latest Reply
Volker
Contributor
  • 1 kudos

Sorry for the late reply, this helped, thank you! 

  • 1 kudos
3 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels