cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Greg
by New Contributor III
  • 1715 Views
  • 1 replies
  • 4 kudos

How to reduce storage space consumed by delta with many updates

I have 1 delta table that I continuously append events into, and a 2nd delta table that I continuously merge into (streamed from the 1st table) that has unique ID's where properties are updated from the events (An ID represents a unique thing that ge...

  • 1715 Views
  • 1 replies
  • 4 kudos
Latest Reply
Jb11
New Contributor II
  • 4 kudos

Did you already solved this problem?

  • 4 kudos
vanessafvg
by New Contributor III
  • 2211 Views
  • 1 replies
  • 3 kudos

Extracting data from excel in datalake storage using openpyxl

i am trying to extract some data into databricks but tripping all over openpyxl, newish user of databricks..from openpyxl import load_workbookdirectory_id="hidden"scope="hidden"client_id="hidden"service_credential_key="hidden"container_name="hidden"s...

  • 2211 Views
  • 1 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

Hi @Vanessa Van Gelder​ Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question. Thanks.

  • 3 kudos
sintsan
by New Contributor II
  • 1545 Views
  • 1 replies
  • 1 kudos

Resolved! spark.sparkContext.setCheckpointDir - External Azure Storage

Is it possible to direct spark.sparkContext.setCheckpointDir to an external Azure Storage Container location (instead of DBFS) & if so how, there's very little documentation on that.

  • 1545 Views
  • 1 replies
  • 1 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 1 kudos

yes,the directory must be an HDFS path if running on a cluster.All you need to do is provide the correct path.

  • 1 kudos
tototox
by New Contributor III
  • 3034 Views
  • 3 replies
  • 2 kudos

dbutils.fs.ls overlaps with managed storage error

I created a schema with that route as a managed location.(abfss://~~@~~.dfs.core.windows.net/dejeong/)However, I dropped shcema with the cascade option, and also entered the azure portal and deleted the path directly. and made it again(abfss://~~@~~....

  • 3034 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @jin park​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your...

  • 2 kudos
2 More Replies
William_Scardua
by Valued Contributor
  • 5535 Views
  • 3 replies
  • 4 kudos

How do you structure and storage you medallion architecture ?

Hi guys,How you suggestion about how to create a medalion archeterure ? how many and what datalake zones, how store data, how databases used to store, anuthing I think that zones:1.landing zone, file storage in /landing_zone - databricks database.bro...

  • 5535 Views
  • 3 replies
  • 4 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 4 kudos

Hi @William Scardua​ ,I will highly recommend you to use Delta Live Tables (DLT) for your use case. Please check the docs with sample notebooks here https://docs.databricks.com/workflows/delta-live-tables/index.html

  • 4 kudos
2 More Replies
bchaubey
by Contributor II
  • 1683 Views
  • 1 replies
  • 0 kudos
  • 1683 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16764241763
Honored Contributor
  • 0 kudos

@Bhagwan Chaubey​ May be you can give this a try, if this is a Blob Storage account.https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python?tabs=environment-variable-windowsFor Datalake storage, please try belowhttps://do...

  • 0 kudos
al_joe
by Contributor
  • 4050 Views
  • 2 replies
  • 0 kudos

Where / how does DBFS store files?

I tried to use %fs head to print the contents of a CSV file used in a training%fs head "/mnt/path/file.csv"but got an error saying cannot head a directory!?Then I did %fs ls on the same CSV file and got a list of 4 files under a directory named as a ...

screenshot image
  • 4050 Views
  • 2 replies
  • 0 kudos
Latest Reply
User16753725182
Databricks Employee
  • 0 kudos

Hi @Al Jo​ , are you still seeing the error while printing the contents of te CSV file?

  • 0 kudos
1 More Replies
Hubert-Dudek
by Esteemed Contributor III
  • 1531 Views
  • 1 replies
  • 15 kudos

Resolved! Write to Azure Delta Lake - optimization request

Databricks/Delta team could optimize some commands which writes to Azure Blob Storage as Azure display that message:

image
  • 1531 Views
  • 1 replies
  • 15 kudos
Latest Reply
Anonymous
Not applicable
  • 15 kudos

Hey there. Thank you for your suggestion. I'll pass this up to the team.

  • 15 kudos
User16857281869
by New Contributor II
  • 2322 Views
  • 1 replies
  • 1 kudos

Resolved! Why do I see a cost explosion in my blob storage account (DBFS storage, blob storage, ...) for my structures streaming job?

Its usually one or more of the following reasons:1) If you are streaming into a table, you should be using .Trigger option to specify the frequency of checkpointing. Otherwise, the job will call the storage API every 10ms to log the transaction data...

  • 2322 Views
  • 1 replies
  • 1 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 1 kudos

please mount cheaper storage (LRS) to custom mount and set there checkpoints,please clear data regularly,if you are using forEac/forEatchBatchh in stream it will save every dataframe on dbfs,please remember not to use display() in production,if on th...

  • 1 kudos
Greg_Galloway
by New Contributor III
  • 7956 Views
  • 4 replies
  • 3 kudos

Resolved! Use of private endpoints for storage in workspace with EnableNoPublicIP=Yes and VnetInjection=No

We know that Databricks with VNET injection (our own VNET) allows is to connect to ADLS Gen2 over private endpoints. This is what we typically do.We have a customer who created Databricks with EnableNoPublicIP=Yes (secure cluster connectivity) and Vn...

  • 7956 Views
  • 4 replies
  • 3 kudos
Latest Reply
User16871418122
Contributor III
  • 3 kudos

Managed VNET is locked and allows very limited config tuning like VNET peering that too facilitated and needs to be done from Databricks UI. If they want more control on VNET they need to migrate to VNET injected workspace.

  • 3 kudos
3 More Replies
Labels