Data Engineering

Forum Posts

Sorted by:

by PunithRaj • New Contributor

12-15-2022 6:24:42 AM

6652 Views
2 replies
2 kudos

How to read a PDF file from Azure Datalake blob storage to Databricks

I have a scenario where I need to read a pdf file from "Azure Datalake blob storage to Databricks", where connection is done through AD access.Generating the SAS token has been restricted in our environment due to security issues. The below script ca...

Data Engineering

6652 Views
2 replies
2 kudos

12-15-2022 6:24:42 AM

View Replies

Latest Reply

Mykola_Melnyk
New Contributor III

04-15-2025 7:30:50 AM

2 kudos

@PunithRaj You can try to use PDF DataSource for Apache Spark for read pdf files directly to the DataFrame. So you will have extracted text and rendered page as image in output. More details here: https://stabrise.com/spark-pdf/df = spark.read.forma...

2 kudos

04-15-2025 7:30:50 AM

1 More Replies

by magnus778 • New Contributor III

11-22-2022 7:50:42 AM

3226 Views
2 replies
4 kudos

Resolved! Error writing parquet to specific container in Azure Data Lake

I'm retrieving two files from container1, transforming them and merging before writing to a container2 within the same Storage Account in Azure. I'm mounting container1, unmouting and mounting countainer2 before writing. My code for writing the parqu...

Data Engineering

3226 Views
2 replies
4 kudos

11-22-2022 7:50:42 AM

View Replies

Latest Reply

Pat
Esteemed Contributor

11-22-2022 10:06:35 AM

4 kudos

Hi @Magnus Asperud ,1 mounting container12 you should persist the data somewhere, creating df doesnt mean that you are reading data from container and have it accessible after unmounting. Make sure to store this merged data somewhere. Not sure if th...

4 kudos

11-22-2022 10:06:35 AM

1 More Replies

by nancy_g • New Contributor III

05-25-2022 8:03:28 AM

6522 Views
4 replies
5 kudos

Resolved! Are Jobs not supported on cluster with Azure Data Lake Storage credential passthrough enabled cluster?

Data Engineering

6522 Views
4 replies
5 kudos

05-25-2022 8:03:28 AM

View Replies

Latest Reply

Rostislaw
New Contributor III

09-16-2022 1:40:33 AM

5 kudos

Right now the feature seems to be public available. It is possible to schedule jobs with ADLS passthough enabled and do not have to provide service principal credentials.However I ask myself how that works behind the scenses. When working interactive...

5 kudos

09-16-2022 1:40:33 AM

3 More Replies

by KamKam • New Contributor

04-26-2022 1:52:28 AM

1975 Views
2 replies
0 kudos

How to write to a folder in a Azure Data Lake container using Delta?

Hi All,How to write to a folder in a Azure Data Lake container using Delta?When I run:write_mode = 'overwrite' write_format = 'delta' save_path = '/mnt/container-name/folder-name' df.write \ .mode(write_mode) \ .format(write_format) \ ....

Data Engineering

1975 Views
2 replies
0 kudos

04-26-2022 1:52:28 AM

View Replies

Latest Reply

jose_gonzalez
Databricks Employee

06-01-2022 5:07:28 PM

0 kudos

Hi @Kamalen Reddy ,Could you share the error message please?

0 kudos

06-01-2022 5:07:28 PM

1 More Replies

by MarcJustice • New Contributor

04-05-2022 6:15:04 PM

2179 Views
2 replies
3 kudos

Is the promise of a data lake simply about data science, data analytics and data quality or can it also be an integral part of core transaction processing also?

Upfront, I want to let you know that I'm not a veteran data jockey, so I apologize if this topic has been covered already or is simply too basic or narrow for this community. That said, I do need help so please feel free to point me in another direc...

Data Engineering

2179 Views
2 replies
3 kudos

04-05-2022 6:15:04 PM

View Replies

Latest Reply

Aashita
Databricks Employee

04-05-2022 8:37:49 PM

3 kudos

@Marc Barnett , Databricks’ Lakehouse architecture is the ideal data architecture for data-driven organizations. It combines the best qualities of data warehouses and data lakes to provide a single solution for all major data workloads and supports ...

3 kudos

04-05-2022 8:37:49 PM

1 More Replies

by Development • New Contributor III

04-12-2022 11:25:00 PM

6894 Views
5 replies
5 kudos

Delta Table with 130 columns taking time

Hi All,We are facing one un-usual issue while loading data into Delta table using Spark SQL. We have one delta table which have around 135 columns and also having PARTITIONED BY. For this trying to load 15 millions of data volume but its not loading ...

Data Engineering

6894 Views
5 replies
5 kudos

04-12-2022 11:25:00 PM

View Replies

Latest Reply

Development
New Contributor III

04-27-2022 8:27:46 AM

5 kudos

@Kaniz Fatma @Parker Temple I found an root cause its because of serialization. we are using UDF to drive an column on dataframe, when we are trying to load data into delta table or write data into parquet file we are facing serialization issue ....

5 kudos

04-27-2022 8:27:46 AM

4 More Replies

by Bhanu1 • New Contributor III

03-14-2022 2:51:46 PM

5645 Views
3 replies
6 kudos

Resolved! Is it possible to mount different Azure Storage Accounts for different clusters in the same workspace?

We have a development and a production data lake. Is it possible to have a production or development cluster access only respective mounts using init scripts?

Data Engineering

5645 Views
3 replies
6 kudos

03-14-2022 2:51:46 PM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

03-14-2022 3:25:54 PM

6 kudos

Yes it is possible. Additionally mount is permanent and done in dbfs so it is enough to run it one time. you can have for example following configuration:In Azure you can have 2 databricks workspace,cluster in every workspace can have env variable is...

6 kudos

03-14-2022 3:25:54 PM

2 More Replies

by hetadesai • New Contributor II

01-18-2022 6:10:19 AM

7358 Views
1 replies
3 kudos

How to download zip file from SFTP location and put that file into Azure Data Lake and unzip there ?

I have zip file on SFTP location. I want to copy that file from SFTP location and put it into Azure Data lake and want to unzip there using spark notebook. Please help me to solve this.

Data Engineering

7358 Views
1 replies
3 kudos

01-18-2022 6:10:19 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

01-18-2022 12:42:03 PM

3 kudos

I would go with @Kaniz Fatma approach and download data in Data Factory and after is downloaded on success trigger databricks spark notebook. With spark you can read also compressed data so maybe you will not need to do even separate unzip.

3 kudos

01-18-2022 12:42:03 PM

by Hubert-Dudek • Esteemed Contributor III

11-15-2021 3:48:22 AM

3250 Views
2 replies
13 kudos

Resolved! something like AWS Macie to perform scans on Azure Data Lake

Does anyone know alternative for AWS Macie in Azure?AWS Macie scan S3 buckets for files with sensitive data (personal address, credit card etc...).I would like to use the same style ready scanner for Azure Data Lake.

Data Engineering

3250 Views
2 replies
13 kudos

11-15-2021 3:48:22 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

11-15-2021 4:58:17 AM

13 kudos

thank you, I checked and yes it is definitely the way to go

13 kudos

11-15-2021 4:58:17 AM

1 More Replies

by FMendez • New Contributor III

09-10-2021 4:29:06 AM

17951 Views
3 replies
6 kudos

Resolved! How can you mount an Azure Data Lake (gen2) using abfss and Shared Key?

I wanted to mount a ADLG2 on databricks and take advantage on the abfss driver which should be better for large analytical workloads (is that even true in the context of DB?).Setting an OAuth is a bit of a pain so I wanted to take the simpler approac...

Data Engineering

17951 Views
3 replies
6 kudos

09-10-2021 4:29:06 AM

View Replies

Latest Reply

User16753724663
Databricks Employee

10-04-2021 2:57:28 AM

6 kudos

Hi @Fernando Mendez ,The below document will help you to mount the ADLS gen2 using abfss:https://docs.databricks.com/data/data-sources/azure/adls-gen2/azure-datalake-gen2-get-started.htmlCould you please check if this helps?

6 kudos

10-04-2021 2:57:28 AM

2 More Replies

by SarahDorich • New Contributor II

08-16-2021 12:23:31 PM

4414 Views
3 replies
0 kudos

How to register datasets for Detectron2

I'm trying to run a Detectron2 model in Databricks and cannot figure out how to register my train, val and test datasets. My datasets live in an Azure data lake. I have tried the following with no luck. Any help is appreciated. 1) Specifying full p...

Data Engineering

4414 Views
3 replies
0 kudos

08-16-2021 12:23:31 PM

View Replies

Latest Reply

Thurman
New Contributor II

08-17-2021 9:58:35 PM

0 kudos

0 kudos

08-17-2021 9:58:35 PM

2 More Replies

by User16765131552 • Databricks Employee

06-18-2021 12:39:40 PM

8370 Views
1 replies
0 kudos

Read excel files and append to make one data frame in Databricks from azure data lake without specific file names

I am storing excel files in Azure data lake (gen 1). They follow filenames follow the same pattern "2021-06-18T09_00_07ONR_Usage_Dataset", "2021-06-18T09_00_07DSS_Usage_Dataset", etc. depending on the date and time. I want to read all the files in th...

Data Engineering

8370 Views
1 replies
0 kudos

06-18-2021 12:39:40 PM

View Replies

Latest Reply

Ryan_Chynoweth
Databricks Employee

06-21-2021 11:33:00 AM

0 kudos

If you are attempting to read all the files in a directory you should be able to use a wild card and filter using the extension. For example: df = (spark .read .format("com.crealytics.spark.excel") .option("header", "True") .option("inferSchema", "tr...

0 kudos

06-21-2021 11:33:00 AM

by microamp • New Contributor II

01-26-2018 2:52:59 AM

14876 Views
12 replies
0 kudos

Azure Data Lake Config Issue: No value for dfs.adls.oauth2.access.token.provider found in conf file.

Hi,I have files hosted on an Azure Data Lake Store which I can connect from Azure Databricks configured as per instructions here.I can read JSON files fine, however, I'm getting the following error when I try to read an Avro file.spark.read.format("c...

Data Engineering

14876 Views
12 replies
0 kudos

01-26-2018 2:52:59 AM

View Replies

Latest Reply

User16301467523
Databricks Employee

06-11-2018 3:46:47 PM

0 kudos

Taras's answer is correct. Because spark-avro is based on the RDD APIs, the properties must be set in the hadoopConfiguration options. Please note these docs for configuration using the RDD API: https://docs.azuredatabricks.net/spark/latest/data-sou...

0 kudos

06-11-2018 3:46:47 PM

11 More Replies

by juan_perez • New Contributor

08-03-2018 7:00:20 AM

15996 Views
2 replies
0 kudos

Write data Frame into Azure Data Lake Storage

It happens that I am manipulating some data using Azure Databricks. Such data is in an Azure Data Lake Storage Gen1. I mounted the data into DBFS, but now, after transforming the data I would like to write it back into my data lake. To mount the dat...

Data Engineering

15996 Views
2 replies
0 kudos

08-03-2018 7:00:20 AM

View Replies

Latest Reply

PawanShukla
New Contributor III

09-29-2018 3:36:27 AM

0 kudos

I am new in Azure Data Bricks..and I am trying to write the Data frame in mounted ADLS file. But in below command dfGPS.write.mode("overwrite").format("com.databricks.spark.csv").option("header","true").csv("/mnt/<mount-name>")

0 kudos

09-29-2018 3:36:27 AM

1 More Replies