Data Engineering

Forum Posts

Sorted by:

by User16790091296 • Contributor II

06-24-2021 8:17:28 AM

1098 Views
1 replies
0 kudos

How to efficiently read the data lake files' metadata?

I want to read the last modified datetime of the files in data lake in a databricks script. If I could read it efficiently as a column when reading data from data lake, it would be perfect.Thank you:)

Data Engineering

1098 Views
1 replies
0 kudos

06-24-2021 8:17:28 AM

View Replies

Latest Reply

KrunalMedapara
New Contributor II

06-13-2024 12:17:33 AM

0 kudos

Efficiently reading data lake files involves:Choosing the Right Tools: Select tools optimized for data lake file formats (e.g., Parquet, ORC) and distributed computing frameworks (e.g., Apache Spark, Apache Flink).Partitioning and Indexing: Partition...

0 kudos

06-13-2024 12:17:33 AM

by gtaspark • New Contributor II

02-05-2020 12:57:42 PM

55226 Views
9 replies
5 kudos

Resolved! How to get the total directory size using dbutils

Is there a way to get the directory size in ADLS(gen2) using dbutils in databricks? If I run this dbutils.fs.ls("/mnt/abc/xyz") I get the file sizes inside the xyz folder( there are about 5000 files), I want to get the size of the XYZ folder how ca...

Data Engineering

55226 Views
9 replies
5 kudos

02-05-2020 12:57:42 PM

View Replies

Latest Reply

User16788316720
New Contributor III

06-21-2023 10:22:51 AM

5 kudos

File size is only specified for files. So, if you specify a directory as your source, you have to iterate through the directory. The below snippet should work (and should be faster than the other solutions).import glob def get_directory_size_in_byt...

5 kudos

06-21-2023 10:22:51 AM

8 More Replies

by Anotech • New Contributor II

06-23-2023 7:29:42 AM

8814 Views
2 replies
1 kudos

How can I fix this error. ExecutionError: An error occurred while calling o392.mount: java.lang.NullPointerException

Hello, I'm trying to mount my Databricks to my Azure gen 2 data lake to read in data from the container, but I get an error when executing this line of code: dbutils.fs.mount( source = "abfss://resumes@choisysresume.dfs.core.windows.net/", mount_poin...

Data Engineering

8814 Views
2 replies
1 kudos

06-23-2023 7:29:42 AM

View Replies

Latest Reply

Anonymous
Not applicable

06-27-2023 5:00:22 AM

1 kudos

checked it with my mount script and that is exactly the same except that I do not put a '/' after dfs.core.windows.netYou might wanna try that.Also, is Unity enabled? Because Unity does not allow mounts.

1 kudos

06-27-2023 5:00:22 AM

1 More Replies

by g96g • New Contributor III

04-18-2023 12:14:57 AM

2303 Views
3 replies
0 kudos

data is not written back to data lake

I have this strange case where data is not written back to data lake. I have 3 container- . Bronze, Silver and Gold. I have done the mounting and have not problem to read the source data and write it Bronze layer ( using hive meta store catalog). T...

Data Engineering

2303 Views
3 replies
0 kudos

04-18-2023 12:14:57 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-23-2023 9:34:51 PM

0 kudos

Hi @Givi Salu Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we ...

0 kudos

04-23-2023 9:34:51 PM

2 More Replies

by Rishabh-Pandey • Esteemed Contributor

03-11-2023 12:13:52 AM

1166 Views
0 replies
3 kudos

Hey there! I've noticed that many people seem to be confused about the differences between databases, data warehouses, and data lakes. It's un...

Hey there! I've noticed that many people seem to be confused about the differences between databases, data warehouses, and data lakes. It's understandable, as these terms can be easily misunderstood or used interchangeablyHere is the summary for all ...

Data Engineering

1166 Views
0 replies
3 kudos

03-11-2023 12:13:52 AM

by JesseS • New Contributor II

01-12-2023 7:02:20 AM

6518 Views
2 replies
1 kudos

Resolved! How to extract source data from on-premise databases into a data lake and load with AutoLoader?

Here is the situation I am working with. I am trying to extract source data using Databricks JDBC connector using SQL Server databases as my data source. I want to write those into a directory in my data lake as JSON files, then have AutoLoader ing...

Data Engineering

6518 Views
2 replies
1 kudos

01-12-2023 7:02:20 AM

View Replies

Latest Reply

Aashita
Databricks Employee

01-12-2023 9:41:03 AM

1 kudos

To add to @werners point, I would use ADF to load SQL server data into ADLS Gen 2 as json. Then Load these Raw Json files from your ADLS base location into a Delta table using Autoloader.Delta Live Tables can be used in this scenario.You can also reg...

1 kudos

01-12-2023 9:41:03 AM

1 More Replies

by DB_developer • New Contributor III

12-08-2022 5:08:20 AM

9603 Views
2 replies
3 kudos

How to optimize storage for sparse data in data lake?

I have lot of tables with 80% of columns being filled with nulls. I understand SQL sever provides a way to handle these kind of data during the data definition of the tables (with Sparse keyword). Do datalake provide similar kind of thing?

Data Engineering

9603 Views
2 replies
3 kudos

12-08-2022 5:08:20 AM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

12-08-2022 6:17:23 AM

3 kudos

datalake itself not, but the file format you use to store data does.f.e. parquet uses column compression, so sparse data will compress pretty good.csv on the other hand: total disaster

3 kudos

12-08-2022 6:17:23 AM

1 More Replies

by DB_developer • New Contributor III

12-07-2022 11:14:25 PM

1750 Views
3 replies
0 kudos

How many bytes are used to store null values in data lake?

Data Engineering

1750 Views
3 replies
0 kudos

12-07-2022 11:14:25 PM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

12-08-2022 6:27:37 AM

0 kudos

there is no single answer to this.If you look at parquet, which is a very common format on data lakes:https://parquet.apache.org/docs/file-format/nulls/and on SO

0 kudos

12-08-2022 6:27:37 AM

2 More Replies

by Paully • New Contributor

12-07-2022 7:28:19 AM

1224 Views
0 replies
0 kudos

Overwrite still saves numerous parquet files in storage container

I inherited this environment and my question is we have a job that mines the the data lake and creates a table that's is grouped by unit number and their data points. The job runs every 10 minutes. We then connect to that table direct query power bi ...

Data Engineering

1224 Views
0 replies
0 kudos

12-07-2022 7:28:19 AM

by rt2 • New Contributor III

10-14-2022 2:52:16 PM

1597 Views
2 replies
3 kudos

Resolved! Fundamentals of Databricks Lakehouse Badge not recieved.

I passed the databricks fundamental exam and like many others I too did not recieved my badge.I am very much intrested in putting this badge on my linkedin profile, please help.My email id is: rahul.psit.ec@gmail.comWhich databricks is resolving as: ...

Data Engineering

1597 Views
2 replies
3 kudos

10-14-2022 2:52:16 PM

View Replies

Latest Reply

rt2
New Contributor III

10-14-2022 9:39:35 PM

3 kudos

I got the badge now. Thanks.

3 kudos

10-14-2022 9:39:35 PM

1 More Replies

by Direo • Contributor II

04-07-2022 5:09:31 AM

1988 Views
1 replies
5 kudos

Is it possible to write tables to delta lake using upsert mode? Would it be more efficiant than overwrite?

Data Engineering

1988 Views
1 replies
5 kudos

04-07-2022 5:09:31 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

04-07-2022 5:16:25 AM

5 kudos

@Direo Direo , Yes, you use Merge syntax for that https://docs.delta.io/latest/delta-update.html.And is more efficient than overwriting if you want to update only part of the data, but you need to think about the logic of what to update so overwriti...

5 kudos

04-07-2022 5:16:25 AM

by stramzik • New Contributor II

07-25-2021 10:24:26 PM

1593 Views
1 replies
1 kudos

Unable to mount datalake gen1 to databricks

I was mounting the Datalake Gen1 to Databricks for accessing and processing files, The below code was working great for the past 1 year and all of a sudden I'm getting an errorconfigs = {"df.adl.oauth2.access.token.provider.type": "ClientCredential"...

Data Engineering

1593 Views
1 replies
1 kudos

07-25-2021 10:24:26 PM

View Replies

Latest Reply

stramzik
New Contributor II

07-28-2021 4:01:43 AM

1 kudos

bumping up the thread

1 kudos

07-28-2021 4:01:43 AM