Data Engineering

Forum Posts

Sorted by:

Start a conversation

by Kamal2 • New Contributor II

09-23-2021 1:37:09 AM

21295 Views
5 replies
7 kudos

Resolved! PDF Parsing in Notebook

I have pdf files stored in azure adls.i want to parse pdf files in pyspark dataframeshow can i do that ?

Data Engineering

21295 Views
5 replies
7 kudos

09-23-2021 1:37:09 AM

View Replies

Latest Reply

Mykola_Melnyk
New Contributor III

02-02-2025 9:17:29 AM

7 kudos

PDF Data Source works now on Databricks.Instruction with example: https://stabrise.com/blog/spark-pdf-on-databricks/

7 kudos

02-02-2025 9:17:29 AM

4 More Replies

by jsaddam28 • New Contributor III

09-04-2015 12:18:32 AM

52446 Views
25 replies
16 kudos

How to import local python file in notebook?

for example I have one.py and two.py in databricks and I want to use one of the module from one.py in two.py. Usually I do this in my local machine by import statement like below two.py__ from one import module1 . . . How to do this in databricks???...

Data Engineering

52446 Views
25 replies
16 kudos

09-04-2015 12:18:32 AM

View Replies

Latest Reply

PabloCSD
Valued Contributor

01-22-2025 8:14:43 AM

16 kudos

This alternative worked for us: https://community.databricks.com/t5/data-engineering/is-it-possible-to-import-functions-from-a-module-in-workspace/td-p/5199

16 kudos

01-22-2025 8:14:43 AM

24 More Replies

by boskicl • New Contributor III

03-23-2022 11:04:23 AM

31082 Views
6 replies
10 kudos

Resolved! Table write command stuck "Filtering files for query."

Hello all,Background:I am having an issue today with databricks using pyspark-sql and writing a delta table. The dataframe is made by doing an inner join between two tables and that is the table which I am trying to write to a delta table. The table ...

Data Engineering

31082 Views
6 replies
10 kudos

03-23-2022 11:04:23 AM

View Replies

Latest Reply

timo199
New Contributor II

12-29-2024 11:22:43 PM

10 kudos

Even if I vacuum and optimize, it keeps getting stuck.cluster type is r6gd.xlarge min:4, max:6driver type is r6gd.2xlarge

10 kudos

12-29-2024 11:22:43 PM

5 More Replies

by vanshikagupta • New Contributor II

05-09-2018 8:19:30 AM

8076 Views
3 replies
0 kudos

conversion of code from scala to python

does databricks community edition provides with databricks ML visualization for pyspark, just the same as provided in this link for scala. https://docs.azuredatabricks.net/_static/notebooks/decision-trees.html also please help me to convert this lin...

Data Engineering

8076 Views
3 replies
0 kudos

05-09-2018 8:19:30 AM

View Replies

Latest Reply

thelogicplus
Contributor

12-14-2024 10:15:23 AM

0 kudos

you may explore the tool and services from Travinto Technologies . They have very good tools. We had explored their tool for our code coversion from Informatica, Datastage and abi initio to DATABRICKS , pyspark. Also we used for SQL queries, stored ...

0 kudos

12-14-2024 10:15:23 AM

2 More Replies

by RantoB • Valued Contributor

10-29-2021 4:08:48 AM

25076 Views
8 replies
7 kudos

Resolved! read csv directly from url with pyspark

I would like to load a csv file directly to a spark dataframe in Databricks. I tried the following code :url = "https://opendata.reseaux-energies.fr/explore/dataset/eco2mix-national-tr/download/?format=csv&timezone=Europe/Berlin&lang=fr&use_labels_fo...

Data Engineering

25076 Views
8 replies
7 kudos

10-29-2021 4:08:48 AM

View Replies

Latest Reply

anwangari
New Contributor II

12-10-2024 1:41:13 AM

7 kudos

Hello it's end of 2024 and I still have this issue with python. As mentioned sc method nolonger works. Also, working with volumes within "/databricks/driver/" is not supported in Apache Spark.ALTERNATIVE SOLUTION: Use requests to download the file fr...

7 kudos

12-10-2024 1:41:13 AM

7 More Replies

by RateVan • New Contributor II

04-01-2023 4:31:49 AM

3124 Views
4 replies
0 kudos

Spark last window dont flush in append mode

The problem is very simple, when you use TUMBLING window with append mode, then the window is closed only when the next message arrives (+watermark logic). In the current implementation, if you stop incoming streaming data, the last window will NEVER...

Data Engineering

3124 Views
4 replies
0 kudos

04-01-2023 4:31:49 AM

View Replies

Latest Reply

Dtank
New Contributor II

12-05-2024 1:00:52 AM

0 kudos

Do you have any solution for this ?

0 kudos

12-05-2024 1:00:52 AM

3 More Replies

by avnish26 • New Contributor III

08-23-2022 4:40:36 AM

12386 Views
5 replies
9 kudos

Spark 3.3.0 connect kafka problem

I am trying to connect to my Kafka from spark but getting an error:Kafka Version: 2.4.1Spark Version: 3.3.0I am using jupyter notebook to execute the pyspark code below:```from pyspark.sql.functions import *from pyspark.sql.types import *#import libr...

Data Engineering

12386 Views
5 replies
9 kudos

08-23-2022 4:40:36 AM

View Replies

Latest Reply

jose_gonzalez
Databricks Employee

01-03-2024 2:23:22 PM

9 kudos

Hi @avnish26, did you added the Jar files to the cluster? do you still have issues? please let us know

9 kudos

01-03-2024 2:23:22 PM

4 More Replies

by William_Scardua • Valued Contributor

01-28-2022 11:02:38 AM

2779 Views
1 replies
3 kudos

How to use Pylint to check your pyspark code quality ?

Hi guys,I would like to use the Pylint to check my pyspark scripts, do you do that ?Thank you ?

Data Engineering

2779 Views
1 replies
3 kudos

01-28-2022 11:02:38 AM

View Replies

Latest Reply

developer_lumo
New Contributor II

11-12-2024 9:47:46 AM

3 kudos

Currently I am working on Databricks (Notebooks) and have the same issue as unable to find a linter that is well integrated with Python, Pyspark and databricks notebooks.

3 kudos

11-12-2024 9:47:46 AM

by Arpi • New Contributor II

02-18-2023 9:31:29 AM

4087 Views
4 replies
4 kudos

Resolved! Database creation error

I am trying to create database with external location abfss but facing the below error.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs....

Data Engineering

4087 Views
4 replies
4 kudos

02-18-2023 9:31:29 AM

View Replies

Latest Reply

source2sea
Contributor

10-21-2024 8:21:14 AM

4 kudos

Changing it to a CLUSTER level for OAuth authentication helped me solve the problem.I wish the notebook AI bot could tell me the solution.before the changes, my configraiotn was at the notebook leve.and it has below errorsAnalysisException: org.apac...

4 kudos

10-21-2024 8:21:14 AM

3 More Replies

by DJey • New Contributor III

06-02-2023 6:52:05 AM

15989 Views
6 replies
2 kudos

Resolved! MergeSchema Not Working

Hi All, I have a scenario where my Exisiting Delta Table looks like below:Now I have an incremental data with an additional column i.e. owner:Dataframe Name --> scdDFBelow is the code snippet to merge Incremental Dataframe to targetTable, but the new...

Data Engineering

15989 Views
6 replies
2 kudos

06-02-2023 6:52:05 AM

View Replies

Latest Reply

Amin112
New Contributor II

09-26-2024 8:51:35 PM

2 kudos

In Databricks Runtime 15.2 and above, you can specify schema evolution in a merge statement using SQL or Delta table APIs:MERGE WITH SCHEMA EVOLUTION INTO targetUSING sourceON source.key = target.keyWHEN MATCHED THENUPDATE SET *WHEN NOT MATCHED THENI...

2 kudos

09-26-2024 8:51:35 PM

5 More Replies

by weldermartins • Honored Contributor

10-06-2022 5:14:18 AM

8324 Views
6 replies
10 kudos

Resolved! Spark - API Jira

Hello guys. I use pyspark in my daily life. A demand has arisen to collect information in Jira. I was able to do this via Talend ESB, but I wouldn't want to use different tools to get the job done. Do you have any example of how to extract data from ...

Data Engineering

8324 Views
6 replies
10 kudos

10-06-2022 5:14:18 AM

View Replies

Latest Reply

Marty73
New Contributor II

02-15-2024 8:33:45 PM

10 kudos

Hi,There is also a new Databricks for Jira add-on on the Atlassian Marketplace. It is easy to setup and exports are directly created within Jira. They can be one-time, scheduled, or real-time. It can also export additional Jira data such as Assets, C...

10 kudos

02-15-2024 8:33:45 PM

5 More Replies

by greyfine • New Contributor II

03-15-2022 2:06:30 AM

12050 Views
5 replies
5 kudos

Hi Everyone , I was wondering if it is possible to have alerts set up on query level for pyspark notebooks that are run on schedule in databricks so if we have some expected result from it we can receive a mail alert ?

In Above you can see we have 3 workspaces - we have the alert option available in the sql workspace but not in our data science and engineering space , anyway we can incorporate this in our DS and Engineering space ?

Data Engineering

12050 Views
5 replies
5 kudos

03-15-2022 2:06:30 AM

View Replies

Latest Reply

JKR
Contributor

06-25-2024 7:05:06 AM

5 kudos

How can I receive call on teams/number/slack if any jobs fails?

5 kudos

06-25-2024 7:05:06 AM

4 More Replies

by Mado • Valued Contributor II

12-12-2022 9:07:10 PM

16061 Views
4 replies
3 kudos

Resolved! Using "Select Expr" and "Stack" to Unpivot PySpark DataFrame doesn't produce expected results

I am trying to unpivot a PySpark DataFrame, but I don't get the correct results.Sample dataset:# Prepare Data data = [("Spain", 101, 201, 301), \ ("Taiwan", 102, 202, 302), \ ("Italy", 103, 203, 303), \ ("China", 104, 204, 304...

Data Engineering

16061 Views
4 replies
3 kudos

12-12-2022 9:07:10 PM

View Replies

Latest Reply

lukeoz
New Contributor III

05-02-2024 5:40:21 PM

3 kudos

You can also use backticks around the column names that would otherwise be recognised as numbers.from pyspark.sql import functions as F unpivotExpr = "stack(3, '2018', `2018`, '2019', `2019`, '2020', `2020`) as (Year, CPI)" unPivotDF = df.select("C...

3 kudos

05-02-2024 5:40:21 PM

3 More Replies

by Nastasia • New Contributor II

07-29-2021 5:33:27 AM

5159 Views
2 replies
1 kudos

Why is Spark creating multiple jobs for one action?

I noticed that when launching this bunch of code with only one action, I have three jobs that are launched.from pyspark.sql import DataFrame from pyspark.sql.types import StructType, StructField, StringType from pyspark.sql.functions import avgdata:...

Data Engineering

5159 Views
2 replies
1 kudos

07-29-2021 5:33:27 AM

View Replies

Latest Reply

RKNutalapati
Valued Contributor

12-19-2023 1:33:53 PM

1 kudos

The above code will create two jobs.JOB-1. dataframe: DataFrame = spark.createDataFrame(data=data,schema=schema)The createDataFrame function is responsible for inferring the schema from the provided data or using the specified schema.Depending on the...

1 kudos

12-19-2023 1:33:53 PM

1 More Replies

by yutaro_ono1_558 • New Contributor II

07-17-2021 4:07:17 PM

10980 Views
2 replies
1 kudos

How to read data from S3 Access Point by pyspark?

I want to read data from s3 access point.I successfully accessed using boto3 client to data through s3 access point.s3 = boto3.resource('s3')ap = s3.Bucket('arn:aws:s3:[region]:[aws account id]:accesspoint/[S3 Access Point name]')for obj in ap.object...

Data Engineering

10980 Views
2 replies
1 kudos

07-17-2021 4:07:17 PM

View Replies

Latest Reply

shrestha-rj
New Contributor II

11-10-2023 11:48:56 AM

1 kudos

I'm reaching out to seek assistance as I navigate an issue. Currently, I'm trying to read JSON files from an S3 Multi-Region Access Point using a Databricks notebook. While reading directly from the S3 bucket presents no challenges, I encounter an "j...

1 kudos

11-10-2023 11:48:56 AM

1 More Replies