cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

William_Scardua
by Valued Contributor
  • 4990 Views
  • 3 replies
  • 4 kudos

How do you structure and storage you medallion architecture ?

Hi guys,How you suggestion about how to create a medalion archeterure ? how many and what datalake zones, how store data, how databases used to store, anuthing I think that zones:1.landing zone, file storage in /landing_zone - databricks database.bro...

  • 4990 Views
  • 3 replies
  • 4 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 4 kudos

Hi @William Scardua​ ,I will highly recommend you to use Delta Live Tables (DLT) for your use case. Please check the docs with sample notebooks here https://docs.databricks.com/workflows/delta-live-tables/index.html

  • 4 kudos
2 More Replies
Chris_Shehu
by Valued Contributor III
  • 3788 Views
  • 1 replies
  • 5 kudos

Resolved! Getting errors while following Microsoft Databricks Best-Practices for DevOps Integration

I'm currently trying to follow the Software engineering best practices for notebooks - Azure Databricks guide but I keep running into the following during step 4.5: Run the test============================= test session starts =======================...

image.png image image image
  • 3788 Views
  • 1 replies
  • 5 kudos
Latest Reply
Chris_Shehu
Valued Contributor III
  • 5 kudos

Closing the loop on this in case anyone gets stuck in the same situation. You can see in the images that the transforms_test.py shows a different icon then the testdata.csv. This is because it was saved as a juypter notebook not a .py file. When the ...

  • 5 kudos
140015
by New Contributor III
  • 1235 Views
  • 1 replies
  • 0 kudos

Resolved! Is S3 dbfs mount faster then direct access?

Hi,Is there any speed difference between mounted s3 bucket and direct access during reading/writing delta tables or other type of files? I tried to find something in docs, but didn't found anything.

  • 1235 Views
  • 1 replies
  • 0 kudos
Latest Reply
Vivian_Wilfred
Databricks Employee
  • 0 kudos

Hi @Jacek Dembowiak​ , behind the scenes, mounting an S3 bucket and reading from it works the same way as directly accessing it. Mounts are just metadata, the underlying access mechanism is the same for both the scenarios you mentioned. Mounting the ...

  • 0 kudos
Mado
by Valued Contributor II
  • 1862 Views
  • 2 replies
  • 3 kudos

How to apply Pandas functions on PySpark DataFrame?

Hi, I want to apply Pandas functions (like isna, concat, append, etc) on PySpark DataFrame in such a way that computations are done on multi-node cluster.I don't want to convert PySpark DataFrame into Pandas DataFrame since, I think, only one node is...

  • 1862 Views
  • 2 replies
  • 3 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 3 kudos

The best is to use pandas on a spark, it is virtually interchangeable so it just different API for Spark data frameimport pyspark.pandas as ps   psdf = ps.range(10) sdf = psdf.to_spark().filter("id > 5") sdf.show()

  • 3 kudos
1 More Replies
AJDJ
by New Contributor III
  • 6298 Views
  • 9 replies
  • 4 kudos

Delta Lake Demo - Not working

Hi there, I imported the delta lake demo notebook from databricks link and at command 12 it errors out. I tired other ways and path but couldnt get past the error. May be the notebook is outdated?https://www.databricks.com/notebooks/Demo_Hub-Delta_La...

  • 6298 Views
  • 9 replies
  • 4 kudos
Latest Reply
Anonymous
Not applicable
  • 4 kudos

Hi @AJ DJ​ Does @Hubert Dudek​  response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?We'd love to hear from you.Thanks!

  • 4 kudos
8 More Replies
JoeS
by New Contributor III
  • 5963 Views
  • 1 replies
  • 1 kudos

When will Github Copilot be available in the Databricks IDE?

It's been quite difficult to stay in VSCode while developing data science experiments and tooling for Databricks. Our team would like to have Github Copilot for the databricks IDE.

  • 5963 Views
  • 1 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @Joe Shull​ Does @Kaniz Fatma​  response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?We'd love to hear from you.Thanks!

  • 1 kudos
RJB
by New Contributor II
  • 10251 Views
  • 6 replies
  • 0 kudos

Resolved! How to pass outputs from a python task to a notebook task

I am trying to create a job which has 2 tasks as follows:A python task which accepts a date and an integer from the user and outputs a list of dates (say, a list of 5 dates in string format).A notebook which runs once for each of the dates from the d...

  • 10251 Views
  • 6 replies
  • 0 kudos
Latest Reply
BilalAslamDbrx
Databricks Employee
  • 0 kudos

Just a note that this feature, Task Values, has been generally available for a while.

  • 0 kudos
5 More Replies
ShenghuaNi
by New Contributor II
  • 1516 Views
  • 1 replies
  • 0 kudos

$200 Voucher

Has any one really get the $200 voucher? I contacted the training support, but still did not get that voucher. The support just said need to investigate, but never reply any more. Do not know what happened. Or this is just a fault ads.

  • 1516 Views
  • 1 replies
  • 0 kudos
Latest Reply
ShenghuaNi
New Contributor II
  • 0 kudos

The training support sent me the voucher number.

  • 0 kudos
hari
by Contributor
  • 19772 Views
  • 3 replies
  • 5 kudos

How to add the partition for an existing delta table

We didn't need to set partitions for our delta tables as we didn't have many performance concerns and delta lake out-of-the-box optimization worked great for us. But there is now a need to set a specific partition column for some tables to allow conc...

  • 19772 Views
  • 3 replies
  • 5 kudos
Latest Reply
hari
Contributor
  • 5 kudos

Updated the description

  • 5 kudos
2 More Replies
Anonymous
by Not applicable
  • 789 Views
  • 0 replies
  • 1 kudos

Heads up! November Community Social!  On November 17th we are hosting another Community Social - we're doing these monthly ! We want to make sure ...

Heads up! November Community Social! On November 17th we are hosting another Community Social - we're doing these monthly ! We want to make sure that we all have the chance to connect as a community often. Come network, talk data, and just get social...

  • 789 Views
  • 0 replies
  • 1 kudos
Taha_Hussain
by Databricks Employee
  • 1437 Views
  • 0 replies
  • 8 kudos

Ask your technical questions at Databricks Office Hours October 26 - 11:00 AM - 12:00 PM PT: Register HereNovember 9 - 8:00 AM - 9:00 AM GMT: Register...

Ask your technical questions at Databricks Office HoursOctober 26 - 11:00 AM - 12:00 PM PT: Register HereNovember 9 - 8:00 AM - 9:00 AM GMT: Register Here (NEW EMEA Office Hours)Databricks Office Hours connects you directly with experts to answer all...

  • 1437 Views
  • 0 replies
  • 8 kudos
pen
by New Contributor II
  • 1742 Views
  • 2 replies
  • 2 kudos

Pyspark will error while I pack source zip package without dir.

If I send the package made by zipfile on spark.submit.pyFiles which zip by this code. import zipfile, os def make_zip(source_dir, output_filename): with zipfile.ZipFile(output_filename, 'w') as zipf: pre_len = len(os.path....

  • 1742 Views
  • 2 replies
  • 2 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 2 kudos

I checked, and your code is ok. If you set source_dir and output_filename please remember to start path with /dbfsIf you work on the community edition you can get problems with access to underlying filesystem.

  • 2 kudos
1 More Replies
mghildiy
by New Contributor
  • 1188 Views
  • 1 replies
  • 1 kudos

Checking spark performance locally

I am experimenting with spark, on my local machine. So, is there some tool/api available to check the performance of the code I write?For eg. I write:val startTime = System.nanoTime() invoicesDF .select( count("*").as("Total Number Of Inv...

  • 1188 Views
  • 1 replies
  • 1 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 1 kudos

Please check the details about your code (task in jobs) in Spark UI.

  • 1 kudos
g96g
by New Contributor III
  • 4846 Views
  • 1 replies
  • 1 kudos

Resolved! how can I pass the df columns as a parameter

Im doing the self study and want pass df column name as a parameter.I have defined the widget column_name= dbutils.widgets.get('column_name')which is executing succefuly ( giving me a column name)then Im reading the df and do some transformation and ...

  • 4846 Views
  • 1 replies
  • 1 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 1 kudos

df2.select([column_name]).writeORdf2.select(column_name).write

  • 1 kudos
Mado
by Valued Contributor II
  • 16728 Views
  • 2 replies
  • 6 kudos

Resolved! Difference between "spark.table" & "spark.read.table"?

Hi,I want to make a PySpark DataFrame from a Table. I would like to ask about the difference of the following commands:spark.read.table(TableName)&spark.table(TableName)Both return PySpark DataFrame and look similar. Thanks.

  • 16728 Views
  • 2 replies
  • 6 kudos
Latest Reply
Mado
Valued Contributor II
  • 6 kudos

Hi @Kaniz Fatma​ I selected answer from @Kedar Deshpande​ as the best answer.

  • 6 kudos
1 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels