cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

IonFreeman_Pace
by New Contributor III
  • 6467 Views
  • 4 replies
  • 1 kudos

Resolved! First notebook in ML course fails with wrong runtime

Help! I'm trying to run this first notebook in the Scalable MachIne LEarning (SMILE) course.https://github.com/databricks-academy/scalable-machine-learning-with-apache-spark-english/blob/published/ML%2000a%20-%20Spark%20Review.pyIt fails on the first...

  • 6467 Views
  • 4 replies
  • 1 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 1 kudos

it means your cluster type has to be a ML runtime.When you create a cluster in databricks, you can choose between different runtimes.These have different version (spark version), but also different types:For your case you need to select the ML menu o...

  • 1 kudos
3 More Replies
Hoping
by New Contributor
  • 3358 Views
  • 0 replies
  • 0 kudos

Size of each partitioned file (partitioned by default)

When I try a describe detail I get the number of files the delta table is partitioned into. How can I check the size of each file of these files that make up my entire table ?Will I be able to query each partitioned file to understand how they have b...

  • 3358 Views
  • 0 replies
  • 0 kudos
eric-cordeiro
by Databricks Partner
  • 2167 Views
  • 0 replies
  • 0 kudos

Insufficient Permission when writing to AWS Redshift

I'm trying to write a table in AWS Redshift using the following code:try:    (df_source.write        .format("redshift")        .option("dbtable", f"{redshift_schema}.{table_name}")        .option("tempdir", tempdir)        .option("url", url)       ...

  • 2167 Views
  • 0 replies
  • 0 kudos
pgruetter
by Contributor
  • 2643 Views
  • 1 replies
  • 0 kudos

Streaming problems after Vaccum

Hi allTo read from a large Delta table, I'm using readStream but with a trigger(availableNow=True) as I only want to run it daily. This worked well for an intial load and then incremental loads after that.At some point though, I received an error fro...

  • 2643 Views
  • 1 replies
  • 0 kudos
param_sen
by New Contributor II
  • 14701 Views
  • 1 replies
  • 1 kudos

Maintain the camelCase column names in the bronze layer, or is it advisable to rename column names

I am utilizing the Databricks autoloader to ingest files from Google Cloud Storage (GCS) into Delta tables in the bronze layer of a Medallion architecture. According to lakehouse principles, the bronze layer should store raw data  Hi dear community,I...

Data Engineering
dataengineering
delta_table
  • 14701 Views
  • 1 replies
  • 1 kudos
Latest Reply
Dribka
New Contributor III
  • 1 kudos

Hey @param_sen ,Navigating the nuances of naming conventions, especially when dealing with different layers in a lakehouse architecture, can be a bit of a puzzle. Your considerations are on point. If consistency across layers is a priority and downst...

  • 1 kudos
eimis_pacheco
by Contributor
  • 11703 Views
  • 3 replies
  • 1 kudos

Resolved! What are the best practices in bronze layer regarding the column data types?

Hi dear community,When I used to work in the Hadoop ecosystem with HDS the landing zone was our raw layer, and we used to use AVRO format for the serialization of this raw data (for the schema evolution feature), only assigning names to columns but n...

  • 11703 Views
  • 3 replies
  • 1 kudos
Latest Reply
param_sen
New Contributor II
  • 1 kudos

Hi dear community,I am utilizing the Databricks autoloader to ingest files from Google Cloud Storage (GCS) into Delta tables in the bronze layer of a Medallion architecture. According to lakehouse principles, the bronze layer should store raw data wi...

  • 1 kudos
2 More Replies
Karo
by New Contributor
  • 1722 Views
  • 0 replies
  • 0 kudos

Function in juypter notebook 12x faster than in python script

Hello dear community,I wrote some ETL functions, e.g. to count the sessions until a conversion (see below). There for I load the data and then execute several small function for the feature generation.When I run the function feat_session_unitl_conver...

  • 1722 Views
  • 0 replies
  • 0 kudos
Erik
by Valued Contributor III
  • 4444 Views
  • 1 replies
  • 0 kudos

Run driver on spot instance

The traditional advice seems to be to run the driver on "on demand", and optionally the workers on spot. And this is indeed what happends if one chooses to run with spot instances in Databricks. But I am interested in what happens if we run with a dr...

  • 4444 Views
  • 1 replies
  • 0 kudos
Latest Reply
Erik
Valued Contributor III
  • 0 kudos

Thanks for your answer @Retired_mod ! Good overview, and I understand that "driver on-demand and the rest on spot" is a good generall advice. But I am still considering using spot instances for both, and I am left with two concrete questions:1: Can w...

  • 0 kudos
Faisal
by Contributor
  • 9606 Views
  • 2 replies
  • 1 kudos

Error while creating delta table with partitions

Hi All,I am unable to create delta table with partitioning option, can someone please correct me what I am missing and help me with updated query  CREATE OR REPLACE TABLE invoice USING DELTA PARTITION BY (year(shp_dt), month(shp_dt)) LOCATION '/ta...

  • 9606 Views
  • 2 replies
  • 1 kudos
Latest Reply
Emil_Kaminski
Contributor II
  • 1 kudos

@Retired_mod Hi. Is that not exactly what I suggested before? Sorry for stupid questions, but I am learning rules or earning kudos and getting solutions approved, therefore suggestions from your end would be appreciated. Thank you.

  • 1 kudos
1 More Replies
hold_my_samosa
by New Contributor II
  • 9438 Views
  • 1 replies
  • 0 kudos

Delta Partition File on Azure ADLS Gen2 Migration

Hello,I am working on a migration project and I am facing issue while migrating delta tables from Azure ADLS Gen1 to Gen2.So, as per the Microsoft migration pre-requisites:File or directory names with only spaces or tabs, ending with a ., containing ...

Data Engineering
azure
datalake
delta
dtabricks
  • 9438 Views
  • 1 replies
  • 0 kudos
BWong
by New Contributor III
  • 10773 Views
  • 8 replies
  • 6 kudos

Resolved! Cannot spin up a cluster

HiWhen I try to spin up a cluster, it gives me a bootstrap timeout error{ "reason": { "code": "BOOTSTRAP_TIMEOUT", "parameters": { "databricks_error_message": "[id: InstanceId(i-00b2b7acdd82e5fde), status: INSTANCE_INITIALIZING, workerEnv...

  • 10773 Views
  • 8 replies
  • 6 kudos
Latest Reply
BWong
New Contributor III
  • 6 kudos

Thanks guys. It's indeed a network issue on the AWS side. It's resolved now

  • 6 kudos
7 More Replies
geertvanhove
by New Contributor III
  • 8151 Views
  • 3 replies
  • 0 kudos

transform a dataframe column as concatenated string

Hello,I have a single column dataframe and I want to transform the content into a stringEG df=abcdefxyzToabc, def, xyz Thanks

  • 8151 Views
  • 3 replies
  • 0 kudos
Latest Reply
geertvanhove
New Contributor III
  • 0 kudos

sure: %pythonfrom pyspark.sql.functions import from_json, col, concat_wsfrom pyspark.sql.types import *schema = StructType([StructField('meterDateTime', StringType(), True), StructField('meterId', LongType(), True), StructField('meteringState', Strin...

  • 0 kudos
2 More Replies
Daniel3
by New Contributor II
  • 12705 Views
  • 2 replies
  • 0 kudos

Resolved! How to use the variable haiving set of values in a spark.sql?

Hi, I have a set of values to be searched from a table, for which i was trying to assign them to a variable first and then trying to use the variable in spark.sql, but i'm unable to fetch the records. Please see the image attached and correct my code...

  • 12705 Views
  • 2 replies
  • 0 kudos
Latest Reply
brockb
Databricks Employee
  • 0 kudos

Hi, One way to address the example provided in your screenshot is by using a combination of a python f-string and a Common Table Expression like shown below. This is assuming that in reality the two tables are different unlike in the provided screens...

  • 0 kudos
1 More Replies
erigaud
by Honored Contributor
  • 3138 Views
  • 3 replies
  • 1 kudos

Incorrect dropped rows count in DLT Event log

Hello, I'm using a DLT pipeline with expectationsexpect_or_drop(...) To test it, I added files that contain records that should be dropped, and indeed when running the pipeline I can see some rows were dropped.However when looking at the DLT Event lo...

  • 3138 Views
  • 3 replies
  • 1 kudos
Latest Reply
Priyanka_Biswas
Databricks Employee
  • 1 kudos

Hello @erigaud  The issue appears to be related to the details.flow_progress.data_quality.dropped_records field always being 0, despite records being dropped. This might be because the expect_or_drop operator isn't updating the dropped_records field ...

  • 1 kudos
2 More Replies
ekar-databricks
by New Contributor II
  • 13846 Views
  • 3 replies
  • 0 kudos

Bigquery - Databricks integration issue.

I am trying to get the Bigquery data to Databricks using Notebooks. Following the steps based on this https://docs.databricks.com/external-data/bigquery.html. I believe I am making some mistake with this step and getting the below error. I tried givi...

image
  • 13846 Views
  • 3 replies
  • 0 kudos
Latest Reply
Wundermobility
New Contributor II
  • 0 kudos

Hi!Did you get the problem solved?I am facing the same issueplease guide

  • 0 kudos
2 More Replies
Labels