cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

PunithRaj
by New Contributor
  • 5407 Views
  • 1 replies
  • 1 kudos

How to read a PDF file from Azure Datalake blob storage to Databricks

I have a scenario where I need to read a pdf file from "Azure Datalake blob storage to Databricks", where connection is done through AD access.Generating the SAS token has been restricted in our environment due to security issues. The below script ca...

  • 5407 Views
  • 1 replies
  • 1 kudos
Latest Reply
Aviral-Bhardwaj
Esteemed Contributor III
  • 1 kudos

Hey @Punith raj​ ,Not sure about Azure but in AWS there is one service known as AWS Transact Please try to explore that onces

  • 1 kudos
magnus778
by New Contributor III
  • 2232 Views
  • 2 replies
  • 4 kudos

Resolved! Error writing parquet to specific container in Azure Data Lake

I'm retrieving two files from container1, transforming them and merging before writing to a container2 within the same Storage Account in Azure. I'm mounting container1, unmouting and mounting countainer2 before writing. My code for writing the parqu...

  • 2232 Views
  • 2 replies
  • 4 kudos
Latest Reply
Pat
Honored Contributor III
  • 4 kudos

Hi @Magnus Asperud​ ,1 mounting container12 you should persist the data somewhere, creating df doesnt mean that you are reading data from container and have it accessible after unmounting. Make sure to store this merged data somewhere. Not sure if th...

  • 4 kudos
1 More Replies
nancy_g
by New Contributor III
  • 4544 Views
  • 4 replies
  • 5 kudos
  • 4544 Views
  • 4 replies
  • 5 kudos
Latest Reply
Rostislaw
New Contributor III
  • 5 kudos

Right now the feature seems to be public available. It is possible to schedule jobs with ADLS passthough enabled and do not have to provide service principal credentials.However I ask myself how that works behind the scenses. When working interactive...

  • 5 kudos
3 More Replies
KamKam
by New Contributor
  • 1320 Views
  • 2 replies
  • 0 kudos

How to write to a folder in a Azure Data Lake container using Delta?

Hi All,How to write to a folder in a Azure Data Lake container using Delta?When I run:write_mode = 'overwrite' write_format = 'delta' save_path = '/mnt/container-name/folder-name'   df.write \ .mode(write_mode) \ .format(write_format) \ ....

  • 1320 Views
  • 2 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 0 kudos

Hi @Kamalen Reddy​ ,Could you share the error message please?

  • 0 kudos
1 More Replies
MarcJustice
by New Contributor
  • 1621 Views
  • 2 replies
  • 3 kudos

Is the promise of a data lake simply about data science, data analytics and data quality or can it also be an integral part of core transaction processing also?

Upfront, I want to let you know that I'm not a veteran data jockey, so I apologize if this topic has been covered already or is simply too basic or narrow for this community. That said, I do need help so please feel free to point me in another direc...

  • 1621 Views
  • 2 replies
  • 3 kudos
Latest Reply
Aashita
Databricks Employee
  • 3 kudos

@Marc Barnett​ , Databricks’ Lakehouse architecture is the ideal data architecture for data-driven organizations. It combines the best qualities of data warehouses and data lakes to provide a single solution for all major data workloads and supports ...

  • 3 kudos
1 More Replies
Development
by New Contributor III
  • 5213 Views
  • 5 replies
  • 5 kudos

Delta Table with 130 columns taking time

Hi All,We are facing one un-usual issue while loading data into Delta table using Spark SQL. We have one delta table which have around 135 columns and also having PARTITIONED BY. For this trying to load 15 millions of data volume but its not loading ...

  • 5213 Views
  • 5 replies
  • 5 kudos
Latest Reply
Development
New Contributor III
  • 5 kudos

@Kaniz Fatma​ @Parker Temple​  I found an root cause its because of serialization. we are using UDF to drive an column on dataframe, when we are trying to load data into delta table or write data into parquet file we are facing  serialization issue ....

  • 5 kudos
4 More Replies
Bhanu1
by New Contributor III
  • 4486 Views
  • 3 replies
  • 6 kudos

Resolved! Is it possible to mount different Azure Storage Accounts for different clusters in the same workspace?

We have a development and a production data lake. Is it possible to have a production or development cluster access only respective mounts using init scripts?

  • 4486 Views
  • 3 replies
  • 6 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 6 kudos

Yes it is possible. Additionally mount is permanent and done in dbfs so it is enough to run it one time. you can have for example following configuration:In Azure you can have 2 databricks workspace,cluster in every workspace can have env variable is...

  • 6 kudos
2 More Replies
hetadesai
by New Contributor II
  • 5943 Views
  • 1 replies
  • 3 kudos

How to download zip file from SFTP location and put that file into Azure Data Lake and unzip there ?

I have zip file on SFTP location. I want to copy that file from SFTP location and put it into Azure Data lake and want to unzip there using spark notebook. Please help me to solve this.

  • 5943 Views
  • 1 replies
  • 3 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 3 kudos

I would go with @Kaniz Fatma​ approach and download data in Data Factory and after is downloaded on success trigger databricks spark notebook. With spark you can read also compressed data so maybe you will not need to do even separate unzip.

  • 3 kudos
Hubert-Dudek
by Esteemed Contributor III
  • 2341 Views
  • 2 replies
  • 13 kudos

Resolved! something like AWS Macie to perform scans on Azure Data Lake

Does anyone know alternative for AWS Macie in Azure?AWS Macie scan S3 buckets for files with sensitive data (personal address, credit card etc...).I would like to use the same style ready scanner for Azure Data Lake.

  • 2341 Views
  • 2 replies
  • 13 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 13 kudos

thank you, I checked and yes it is definitely the way to go

  • 13 kudos
1 More Replies
FMendez
by New Contributor III
  • 14515 Views
  • 3 replies
  • 6 kudos

Resolved! How can you mount an Azure Data Lake (gen2) using abfss and Shared Key?

I wanted to mount a ADLG2 on databricks and take advantage on the abfss driver which should be better for large analytical workloads (is that even true in the context of DB?).Setting an OAuth is a bit of a pain so I wanted to take the simpler approac...

  • 14515 Views
  • 3 replies
  • 6 kudos
Latest Reply
User16753724663
Valued Contributor
  • 6 kudos

Hi @Fernando Mendez​ ,The below document will help you to mount the ADLS gen2 using abfss:https://docs.databricks.com/data/data-sources/azure/adls-gen2/azure-datalake-gen2-get-started.htmlCould you please check if this helps?

  • 6 kudos
2 More Replies
SarahDorich
by New Contributor II
  • 3316 Views
  • 3 replies
  • 0 kudos

How to register datasets for Detectron2

I'm trying to run a Detectron2 model in Databricks and cannot figure out how to register my train, val and test datasets. My datasets live in an Azure data lake. I have tried the following with no luck. Any help is appreciated. 1) Specifying full p...

  • 3316 Views
  • 3 replies
  • 0 kudos
Latest Reply
Thurman
New Contributor II
  • 0 kudos

Register your dataset Optionally, register metadata for your dataset.

  • 0 kudos
2 More Replies
User16765131552
by Contributor III
  • 7002 Views
  • 1 replies
  • 0 kudos

Read excel files and append to make one data frame in Databricks from azure data lake without specific file names

I am storing excel files in Azure data lake (gen 1). They follow filenames follow the same pattern "2021-06-18T09_00_07ONR_Usage_Dataset", "2021-06-18T09_00_07DSS_Usage_Dataset", etc. depending on the date and time. I want to read all the files in th...

  • 7002 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

If you are attempting to read all the files in a directory you should be able to use a wild card and filter using the extension. For example: df = (spark .read .format("com.crealytics.spark.excel") .option("header", "True") .option("inferSchema", "tr...

  • 0 kudos
microamp
by New Contributor II
  • 12821 Views
  • 12 replies
  • 0 kudos

Azure Data Lake Config Issue: No value for dfs.adls.oauth2.access.token.provider found in conf file.

Hi,I have files hosted on an Azure Data Lake Store which I can connect from Azure Databricks configured as per instructions here.I can read JSON files fine, however, I'm getting the following error when I try to read an Avro file.spark.read.format("c...

  • 12821 Views
  • 12 replies
  • 0 kudos
Latest Reply
User16301467523
New Contributor II
  • 0 kudos

Taras's answer is correct. Because spark-avro is based on the RDD APIs, the properties must be set in the hadoopConfiguration options. Please note these docs for configuration using the RDD API: https://docs.azuredatabricks.net/spark/latest/data-sou...

  • 0 kudos
11 More Replies
juan_perez
by New Contributor
  • 14209 Views
  • 2 replies
  • 0 kudos

Write data Frame into Azure Data Lake Storage

It happens that I am manipulating some data using Azure Databricks. Such data is in an Azure Data Lake Storage Gen1. I mounted the data into DBFS, but now, after transforming the data I would like to write it back into my data lake. To mount the dat...

  • 14209 Views
  • 2 replies
  • 0 kudos
Latest Reply
PawanShukla
New Contributor III
  • 0 kudos

I am new in Azure Data Bricks..and I am trying to write the Data frame in mounted ADLS file. But in below command dfGPS.write.mode("overwrite").format("com.databricks.spark.csv").option("header","true").csv("/mnt/<mount-name>")

  • 0 kudos
1 More Replies
Labels