cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

User16790091296
by Contributor II
  • 913 Views
  • 1 replies
  • 0 kudos

How to efficiently read the data lake files' metadata?

I want to read the last modified datetime of the files in data lake in a databricks script. If I could read it efficiently as a column when reading data from data lake, it would be perfect.Thank you:)

  • 913 Views
  • 1 replies
  • 0 kudos
Latest Reply
KrunalMedapara
New Contributor II
  • 0 kudos

Efficiently reading data lake files involves:Choosing the Right Tools: Select tools optimized for data lake file formats (e.g., Parquet, ORC) and distributed computing frameworks (e.g., Apache Spark, Apache Flink).Partitioning and Indexing: Partition...

  • 0 kudos
gtaspark
by New Contributor II
  • 50028 Views
  • 9 replies
  • 5 kudos

Resolved! How to get the total directory size using dbutils

Is there a way to get the directory size in ADLS(gen2) using dbutils in databricks? If I run this dbutils.fs.ls("/mnt/abc/xyz") I get the file sizes inside the xyz folder( there are about 5000 files), I want to get the size of the XYZ folder how ca...

  • 50028 Views
  • 9 replies
  • 5 kudos
Latest Reply
User16788316720
New Contributor III
  • 5 kudos

File size is only specified for files. So, if you specify a directory as your source, you have to iterate through the directory. The below snippet should work (and should be faster than the other solutions).import glob   def get_directory_size_in_byt...

  • 5 kudos
8 More Replies
Anotech
by New Contributor II
  • 6894 Views
  • 2 replies
  • 1 kudos

How can I fix this error. ExecutionError: An error occurred while calling o392.mount: java.lang.NullPointerException

Hello, I'm trying to mount my Databricks to my Azure gen 2 data lake to read in data from the container, but I get an error when executing this line of code: dbutils.fs.mount( source = "abfss://resumes@choisysresume.dfs.core.windows.net/", mount_poin...

  • 6894 Views
  • 2 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

checked it with my mount script and that is exactly the same except that I do not put a '/' after dfs.core.windows.netYou might wanna try that.Also, is Unity enabled?  Because Unity does not allow mounts.

  • 1 kudos
1 More Replies
g96g
by New Contributor III
  • 1915 Views
  • 3 replies
  • 0 kudos

data is not written back to data lake

I have this strange case where data is not written back to data lake. I have 3 container- . Bronze, Silver and Gold. I have done the mounting and have not problem to read the source data and write it Bronze layer ( using hive meta store catalog). T...

  • 1915 Views
  • 3 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Givi Salu​ Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we ...

  • 0 kudos
2 More Replies
Rishabh-Pandey
by Esteemed Contributor
  • 1076 Views
  • 0 replies
  • 3 kudos

Hey there! I've noticed that many people seem to be confused about the differences between databases, data warehouses, and data lakes. It's un...

Hey there! I've noticed that many people seem to be confused about the differences between databases, data warehouses, and data lakes. It's understandable, as these terms can be easily misunderstood or used interchangeablyHere is the summary for all ...

  • 1076 Views
  • 0 replies
  • 3 kudos
JesseS
by New Contributor II
  • 5243 Views
  • 2 replies
  • 1 kudos

Resolved! How to extract source data from on-premise databases into a data lake and load with AutoLoader?

Here is the situation I am working with. I am trying to extract source data using Databricks JDBC connector using SQL Server databases as my data source. I want to write those into a directory in my data lake as JSON files, then have AutoLoader ing...

  • 5243 Views
  • 2 replies
  • 1 kudos
Latest Reply
Aashita
Databricks Employee
  • 1 kudos

To add to @werners point, I would use ADF to load SQL server data into ADLS Gen 2 as json. Then Load these Raw Json files from your ADLS base location into a Delta table using Autoloader.Delta Live Tables can be used in this scenario.You can also reg...

  • 1 kudos
1 More Replies
DB_developer
by New Contributor III
  • 9173 Views
  • 2 replies
  • 3 kudos

How to optimize storage for sparse data in data lake?

I have lot of tables with 80% of columns being filled with nulls. I understand SQL sever provides a way to handle these kind of data during the data definition of the tables (with Sparse keyword). Do datalake provide similar kind of thing?

  • 9173 Views
  • 2 replies
  • 3 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 3 kudos

datalake itself not, but the file format you use to store data does.f.e. parquet uses column compression, so sparse data will compress pretty good.csv on the other hand: total disaster

  • 3 kudos
1 More Replies
DB_developer
by New Contributor III
  • 1416 Views
  • 3 replies
  • 0 kudos
  • 1416 Views
  • 3 replies
  • 0 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 0 kudos

there is no single answer to this.If you look at parquet, which is a very common format on data lakes:https://parquet.apache.org/docs/file-format/nulls/and on SO

  • 0 kudos
2 More Replies
Paully
by New Contributor
  • 1024 Views
  • 0 replies
  • 0 kudos

Overwrite still saves numerous parquet files in storage container

I inherited this environment and my question is we have a job that mines the the data lake and creates a table that's is grouped by unit number and their data points. The job runs every 10 minutes. We then connect to that table direct query power bi ...

  • 1024 Views
  • 0 replies
  • 0 kudos
rt2
by New Contributor III
  • 1365 Views
  • 2 replies
  • 3 kudos

Resolved! Fundamentals of Databricks Lakehouse Badge not recieved.

I passed the databricks fundamental exam and like many others I too did not recieved my badge.I am very much intrested in putting this badge on my linkedin profile, please help.My email id is: rahul.psit.ec@gmail.comWhich databricks is resolving as: ...

  • 1365 Views
  • 2 replies
  • 3 kudos
Latest Reply
rt2
New Contributor III
  • 3 kudos

I got the badge now. Thanks.

  • 3 kudos
1 More Replies
Direo
by Contributor
  • 1795 Views
  • 1 replies
  • 5 kudos
  • 1795 Views
  • 1 replies
  • 5 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 5 kudos

@Direo Direo​ , Yes, you use Merge syntax for that https://docs.delta.io/latest/delta-update.html.And is more efficient than overwriting if you want to update only part of the data, but you need to think about the logic of what to update so overwriti...

  • 5 kudos
stramzik
by New Contributor II
  • 1405 Views
  • 1 replies
  • 1 kudos

Unable to mount datalake gen1 to databricks

I was mounting the Datalake Gen1 to Databricks for accessing and processing files, The below code was working great for the past 1 year and all of a sudden I'm getting an errorconfigs = {"df.adl.oauth2.access.token.provider.type": "ClientCredential"...

  • 1405 Views
  • 1 replies
  • 1 kudos
Latest Reply
stramzik
New Contributor II
  • 1 kudos

bumping up the thread

  • 1 kudos
Labels