Data Engineering

Forum Posts

Sorted by:

by 829023 • Databricks Partner

01-18-2023 3:57:06 AM

1252 Views
1 replies
0 kudos

Fail to load excel data(timeout) in databricks sample notebook

Im working with the sample notebook named '1_Customer Lifetimes.py' in https://github.com/databricks-industry-solutions/customer-lifetime-valueIn notebook, there is the code like this `%run "./config/Data Extract"`This load excel data however it occu...

Data Engineering

1252 Views
1 replies
0 kudos

01-18-2023 3:57:06 AM

View Replies

Latest Reply

daniel_sahal
Databricks MVP

01-18-2023 11:08:39 PM

0 kudos

@Seungsu Lee It could be a destination host issue, configuration issue or network issue.Hard to guess, first check if your cluster has an access to the public internet by running this command:%sh ping -c 2 google.com

0 kudos

01-18-2023 11:08:39 PM

by Phani1 • Databricks MVP

01-18-2023 12:05:31 AM

4834 Views
1 replies
0 kudos

Parent Hierarchy Queries/ Path Function /Recursive CTE's

Problem Statement:We have a scenario where we get the data from the source in the format of (in actual 20 Levels and number of fields are more than 4 but for ease of understanding let’s consider below)The actual code involved 20 levels of 4-5 fields ...

Data Engineering

4834 Views
1 replies
0 kudos

01-18-2023 12:05:31 AM

View Replies

Latest Reply

daniel_sahal
Databricks MVP

01-18-2023 10:54:39 PM

0 kudos

I don't think that we have anything similar as a built-in function. You'll need to write some custom code to achieve that.

0 kudos

01-18-2023 10:54:39 PM

by 477061 • Contributor

11-24-2022 12:49:26 AM

8802 Views
11 replies
13 kudos

Resolved! Is it possible to use other databases within Delta Live Tables (DLT)?

I have set up a DLT with "testing" set as the target database. I need to join data that exists in a "keys" table in my "beta" database, but I get an AccessDeniedException, despite having full access to both databases via a normal notebook.A snippet d...

Data Engineering

8802 Views
11 replies
13 kudos

11-24-2022 12:49:26 AM

View Replies

Latest Reply

477061
Contributor

01-18-2023 7:54:07 AM

13 kudos

As an update to this issue: I was running the DLT pipeline on a personal cluster that had an instance profile defined (as per databricks best practises). As a result, the pipeline did not have permission to access other s3 resources (e.g other databa...

13 kudos

01-18-2023 7:54:07 AM

10 More Replies

by explorer • New Contributor III

01-11-2023 4:10:40 AM

9325 Views
4 replies
3 kudos

Getting error while loading parquet data into Postgres (using spark-postgres library) ClassNotFoundException: Failed to find data source: postgres. Please find packages at http://spark.apache.org/third-party-projects.html Caused by: ClassNotFoundException

Hi Fellas - I'm trying to load parquet data (in GCS location) into Postgres DB (google cloud) . For bulk upload data into PG we are using (spark-postgres library)https://framagit.org/interhop/library/spark-etl/-/tree/master/spark-postgres/src/main/sc...

Data Engineering

9325 Views
4 replies
3 kudos

01-11-2023 4:10:40 AM

View Replies

Latest Reply

explorer
New Contributor III

01-18-2023 7:44:11 AM

3 kudos

Hi @Kaniz Fatma , @Daniel Sahal - Few updates from my side.After so many hits and trials , psycopg2 worked out in my case.We can process 200+GB data with 10 node cluster (n2-highmem-4,32 GB Memory, 4 Cores) and driver 32 GB Memory, 4 Cores with Run...

3 kudos

01-18-2023 7:44:11 AM

3 More Replies

by 138999 • New Contributor

01-18-2023 3:59:03 AM

1410 Views
1 replies
0 kudos

How are parallel and subsequent jobs handled by cluster?

Hello,Apologize for dumb question but i'm new to Databricks and need clarification on following.Are parallel and subsequent jobs able to reuse the same compute resources to keep time and cost overhead as low as possible vs. are they spinning a new cl...

Data Engineering

1410 Views
1 replies
0 kudos

01-18-2023 3:59:03 AM

View Replies

Latest Reply

daniel_sahal
Databricks MVP

01-18-2023 4:55:02 AM

0 kudos

@tanja.savic tanja.savic You can use shared job cluster:https://docs.databricks.com/workflows/jobs/jobs.html#use-shared-job-clustersBut remember that a shared job cluster is scoped to a single job run, and cannot be used by other jobs or runs of the...

0 kudos

01-18-2023 4:55:02 AM

by Ancil • Contributor II

01-17-2023 3:08:23 AM

4245 Views
3 replies
1 kudos

Resolved! PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 1 and the length of input was 2.'.

I have pandas_udf, its working for 1 rows, but I tried with more than one rows getting below error.PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output w...

Data Engineering

4245 Views
3 replies
1 kudos

01-17-2023 3:08:23 AM

View Replies

Latest Reply

Hubert-Dudek
Databricks MVP

01-17-2023 4:18:21 AM

1 kudos

I was testing, and your function is correct. So you need to have an error in inputData type (is all string) or with result_json. Please also check the runtime version. I was using 11 LTS.

1 kudos

01-17-2023 4:18:21 AM

2 More Replies

by Brave • New Contributor II

01-13-2023 12:14:46 PM

6480 Views
5 replies
3 kudos

Resolved! Exporting R data frame variable

Hi all.I am trying to export R data frame variable as csv file.I am using this formula:df<- data.frame(VALIDADOR_FIM)df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/df/df.csv")But isn´t working. ...

Data Engineering

6480 Views
5 replies
3 kudos

01-13-2023 12:14:46 PM

View Replies

Latest Reply

sher
Valued Contributor II

01-14-2023 1:18:16 AM

3 kudos

Please try to execute write.csv with the following path instead:write.csv(TotalData,file='/dbfs/tmp/df.csv',row.names = FALSE)%fs ls /tmp

3 kudos

01-14-2023 1:18:16 AM

4 More Replies

by Prem1 • New Contributor III

08-10-2022 3:00:57 PM

24228 Views
21 replies
11 kudos

java.lang.IllegalArgumentException: java.net.URISyntaxException

I am using Databricks Autoloader to load JSON files from ADLS gen2 incrementally in directory listing mode. All source filename has Timestamp on them. The autoloader works perfectly couple of days with the below configuration and breaks the next day ...

Data Engineering

24228 Views
21 replies
11 kudos

08-10-2022 3:00:57 PM

View Replies

Latest Reply

jshields
New Contributor II

01-04-2023 6:56:35 AM

11 kudos

Hi Everyone,I'm seeing this issue as well - same configuration of the previous posts, using autoloader with incremental file listing turned on. The strange part is that it mostly works despite almost all of the files we're loading having colons incl...

11 kudos

01-04-2023 6:56:35 AM

20 More Replies

by Sandesh87 • New Contributor III

12-08-2022 12:07:53 PM

6938 Views
4 replies
2 kudos

spark-streaming read from specific event hub partition

The azure event hub "my_event_hub" has a total of 5 partitions ("0", "1", "2", "3", "4")The readstream should only read events from partitions "0" and "4"event hub configuration as streaming source:-val name = "my_event_hub" val connectionString = "m...

Data Engineering

6938 Views
4 replies
2 kudos

12-08-2022 12:07:53 PM

View Replies

Latest Reply

keshav
New Contributor II

01-17-2023 9:47:16 AM

2 kudos

I tried using below snippet to receive messages only from partition id=0ehName = "<<EVENT-HUB-NAME>>" # Create event position for partition 0 positionKey1 = { "ehName": ehName, "partitionId": 0 } eventPosition1 = { "offset": "@latest", ...

2 kudos

01-17-2023 9:47:16 AM

3 More Replies

by databicky • Contributor II

01-17-2023 3:32:24 AM

5581 Views
3 replies
0 kudos

Resolved! how to add the background color to excel sheet by python

i just want to add color to excel sheet by python to specific cells, and i done that, but i need to exclude the header column, then if i tried the same method to other sheet it doesn't worked.but that bg color addition is reflected in one sheet but...

Data Engineering

5581 Views
3 replies
0 kudos

01-17-2023 3:32:24 AM

View Replies

Latest Reply

Hubert-Dudek
Databricks MVP

01-17-2023 4:00:23 AM

0 kudos

Convert your dataframe to pandas on sparkcolor cells using style property https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.style.htmlexport to excel using pandas to_excel https://spark.apache.org/d...

0 kudos

01-17-2023 4:00:23 AM

2 More Replies

by alvaro_databric • New Contributor III

01-17-2023 8:06:05 AM

1697 Views
1 replies
0 kudos

Relation between Driver and Executor size

HiI would like to ask for recommendations regarding the size of the driver and the amount of executors managed by that driver. I am aware of the best practices regarding executor size/number but I have doubts about the number of executors a single dr...

Data Engineering

1697 Views
1 replies
0 kudos

01-17-2023 8:06:05 AM

View Replies

Latest Reply

Hubert-Dudek
Databricks MVP

01-17-2023 8:29:21 AM

0 kudos

Depends on your use case. The best is to connect Datatog and see driver and workers utilization https://docs.datadoghq.com/integrations/databricks/?tab=driveronlyJust from my experience, Usually, for big datasets, when need autoscale workers between ...

0 kudos

01-17-2023 8:29:21 AM

by alvaro_databric • New Contributor III

01-17-2023 7:09:56 AM

5983 Views
1 replies
1 kudos

Resolved! Task time Spark UI

Hello all,I would like to know why task times (among other times in Spark UI) display values like 1h 2h when the task does only really take some seconds or minutes. What is the meaning of these high time values I see all around Spark UI.Thanks in adv...

Data Engineering

5983 Views
1 replies
1 kudos

01-17-2023 7:09:56 AM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

01-17-2023 7:28:07 AM

1 kudos

that is accumulated time.https://stackoverflow.com/questions/73302982/task-time-and-gc-time-calculation-in-spark-ui-in-executor-section.

1 kudos

01-17-2023 7:28:07 AM

by bonyfus • New Contributor II

01-17-2023 5:22:03 AM

3920 Views
3 replies
0 kudos

Error when accessing the file from azure blob storage

I am getting the following error when accessing the file in Azure blob storagejava.io.FileNotFoundException: File /10433893690638/mnt/22200/22200Ver1.sps does not exist.Code:ves_blob = dbutils.widgets.get("ves_blob") try: dbutils.fs.ls(ves_blob ) e...

Data Engineering

3920 Views
3 replies
0 kudos

01-17-2023 5:22:03 AM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

01-17-2023 5:52:54 AM

0 kudos

that is certainly an invalid path, as the error shows.with %fs ls /mnt you can show the directory structure of the /mnt directory, assuming the blob storage is mounted.if not, you need to define the access ( URL etc.)

0 kudos

01-17-2023 5:52:54 AM

2 More Replies

by lenonlmsv • New Contributor II

01-17-2023 4:58:56 AM

3027 Views
3 replies
0 kudos

Query API Result

Hi, I'm new here.Currently I have to read information from a query in databricks. I've used the query API to get the query definition but so far I'm not able to run the query and get the results.Is it possible? Thanks

Data Engineering

3027 Views
3 replies
0 kudos

01-17-2023 4:58:56 AM

View Replies

Latest Reply

daniel_sahal
Databricks MVP

01-17-2023 5:12:36 AM

0 kudos

When using the JobsAPI you need to specify dbutils.notebook.exit("returnValue") to pass the results once the notebook finished it's job (https://docs.databricks.com/notebooks/notebook-workflows.html#notebook-workflows-exit).Then you can get notebook_...

0 kudos

01-17-2023 5:12:36 AM

2 More Replies

by databicky • Contributor II

01-16-2023 5:29:49 PM

8046 Views
6 replies
1 kudos

Resolved! how to check dataframe column value

in my dataframe it have one column name like count, if that particular column value is greater than zero, the job needs to get failed, how can i perform that one?

Data Engineering

8046 Views
6 replies
1 kudos

01-16-2023 5:29:49 PM

View Replies

Latest Reply

Hubert-Dudek
Databricks MVP

01-17-2023 1:44:03 AM

1 kudos

Code without collect, which should not be used in production:if df.filter("count > 0").count() > 0: dbutils.notebook.exit('Notebook Failed')you can also use a more aggressive version:if df.filter("count > 0").count() > 0: raise Exception("count bigge...

1 kudos

01-17-2023 1:44:03 AM

5 More Replies

Databricks Community

Forum Posts

Fail to load excel data(timeout) in databricks sample notebook

Parent Hierarchy Queries/ Path Function /Recursive CTE's

Resolved! Is it possible to use other databases within Delta Live Tables (DLT)?

Getting error while loading parquet data into Postgres (using spark-postgres library) ClassNotFoundException: Failed to find data source: postgres. Please find packages at http://spark.apache.org/third-party-projects.html Caused by: ClassNotFoundException

How are parallel and subsequent jobs handled by cluster?

Resolved! PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 1 and the length of input was 2.'.

Resolved! Exporting R data frame variable

java.lang.IllegalArgumentException: java.net.URISyntaxException

spark-streaming read from specific event hub partition

Resolved! how to add the background color to excel sheet by python

Relation between Driver and Executor size

Resolved! Task time Spark UI

Error when accessing the file from azure blob storage

Query API Result

Resolved! how to check dataframe column value

File Arrival Trigger - Multiple tables

Issue while handling Deletes and Inserts in Struct...

DLT with CDC and schema changes in streaming pipel...

how to update not tracked column only in new row v...

Databricks Cost Estimation Template