cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

sarguido
by New Contributor II
  • 6496 Views
  • 5 replies
  • 2 kudos

Delta Live Tables: bulk import of historical data?

Hello! I'm very new to working with Delta Live Tables and I'm having some issues. I'm trying to import a large amount of historical data into DLT. However letting the DLT pipeline run forever doesn't work with the database we're trying to import from...

  • 6496 Views
  • 5 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Sarah Guido​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers y...

  • 2 kudos
4 More Replies
bulbur
by New Contributor II
  • 3183 Views
  • 1 replies
  • 0 kudos

Use pandas in DLT pipeline

Hi,I am trying to work with pandas in a delta live table. I have created some example code: import pandas as pd import pyspark.sql.functions as F pdf = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo", "bar", "bar", "...

  • 3183 Views
  • 1 replies
  • 0 kudos
Latest Reply
bulbur
New Contributor II
  • 0 kudos

I have taken the advice given by the documentation (However, you can include these functions outside of table or view function definitions because this code is run once during the graph initialization phase.) and moved the toPandas call to a function...

  • 0 kudos
Devsh_on_point
by New Contributor
  • 1139 Views
  • 1 replies
  • 1 kudos

Liquid Clustering with Partitioning

Hi Team,Can we use Partitioning and Liquid Clustering in Conjunction? Essentially, partitioning the table first on a specific field and then apply liquid clustering (on other fields)?Alternatively, can we define the order priority of the cluster key ...

  • 1139 Views
  • 1 replies
  • 1 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 1 kudos

Hi @Devsh_on_point ,No, you cant have partitioning and liquid clustering on a table. You can treat liquid clustering as a more performant replacement of partitioning.And yes, you are correct. Order of cluster columns doesn't matter:"Databricks recomm...

  • 1 kudos
vannipart
by New Contributor III
  • 2111 Views
  • 1 replies
  • 1 kudos

Resolved! SparkOutOfMemoryError when merging data into a table that already has data

Hello, There is an issue with merging data from a dataframe into a table 2024 databricksJob aborted due to stage failure: Task 17 in stage 1770.0 failed 4 times, most recent failure: Lost task 17.3 in stage 1770.0 (TID 1669) (1x.xx.xx.xx executor 8):...

  • 2111 Views
  • 1 replies
  • 1 kudos
karthika
by New Contributor II
  • 1496 Views
  • 1 replies
  • 0 kudos

Resolved! Databricks associate certification

 I encountered this experience while attempting my 1st DataBricks certification. Abruptly, Proctor asked me to show my desk, after showing he/she asked multiple times.. . My test got paused multiple times even when I am looking at my screenI want to ...

  • 1496 Views
  • 1 replies
  • 0 kudos
Latest Reply
Aviral-Bhardwaj
Esteemed Contributor III
  • 0 kudos

@Cert-TeamOPS @Cert-Team  Please help this person For Now @karthika  use this  for filing a ticket with our support team. Please allow the support team 24-48 hours for a resolution. In the meantime, you can review the following documentation:Room req...

  • 0 kudos
hari-prasad
by Valued Contributor II
  • 8574 Views
  • 8 replies
  • 2 kudos

Spark read GZ file as corrupted data, when file extension having .GZ in upper case

if file is renamed with file_name.sv.gz (lower case extension) is working fine, if file_name.sv.GZ (upper case extension) the data is read as corrupted, means it simply reading compressed file as is. 

hprasad_0-1705667590987.png
Data Engineering
gzip files
spark-csv
spark.read.csv
  • 8574 Views
  • 8 replies
  • 2 kudos
Latest Reply
hari-prasad
Valued Contributor II
  • 2 kudos

Recently I restarted look at a solution for this issue, I found out we can add few exception for allowing "GZ" in hadoop library as GzipCodec is invoked from there.

  • 2 kudos
7 More Replies
vjani
by New Contributor III
  • 2623 Views
  • 4 replies
  • 5 kudos

Resolved! Global init script not running

Hello Databricks Community,I am trying to connect databricks with datadog and have added datadog agent script in global init but it did not worked. Just to make sure if init script is working or not I have added below two lined of code in global init...

  • 2623 Views
  • 4 replies
  • 5 kudos
Latest Reply
vjani
New Contributor III
  • 5 kudos

Thanks Slash for the reply. That seems to be a reason. I was following https://docs.datadoghq.com/integrations/databricks/?tab=driveronly and missed that configuration.

  • 5 kudos
3 More Replies
anand_k
by New Contributor II
  • 979 Views
  • 1 replies
  • 1 kudos

Variant Support in SQL Alchemy

Databricks now supports the VARIANT data type, which works well in the UI and within Spark environments. However, when working with SQLAlchemy, the VARIANT type doesn't seem to be fully implemented in the latest databricks-sql-connector[sqlalchemy]. ...

  • 979 Views
  • 1 replies
  • 1 kudos
Latest Reply
Witold
Databricks Partner
  • 1 kudos

This is actually an open source project. By looking at the code, it seems that VARIANT is not yet supported. Depending on your knowledge of the code base, you could create an own PR. Or just open an issue there, and wait for the support of the devs.

  • 1 kudos
RobCox
by New Contributor II
  • 1251 Views
  • 1 replies
  • 1 kudos

Unable to Analyze External Delta tables due to failed to initialize filesystem

Hello,I've recently noticed we've never been using Analyze Table, after doing z-ordering / liquid clustering investigations and noticing the query plans for our delta tables were not considering these paths.I'm trying to execute the following command...

  • 1251 Views
  • 1 replies
  • 1 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 1 kudos

Hi @RobCox, This might be due to incorrect configuration settings or insufficient permissions. Ensure that the fs.azure.account.key configuration is accurate and that the service principal or identity running the command has the necessary permissions...

  • 1 kudos
jenshumrich
by Contributor
  • 2925 Views
  • 4 replies
  • 3 kudos

Databricks resets notebook all the time

Whenever I run my script it resets the notebook state:"The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.at com.databricks.spark.chauffeur.Chauffeur.onDriverStateChange(Chauffeur.scala:1467)"T...

  • 2925 Views
  • 4 replies
  • 3 kudos
Latest Reply
jenshumrich
Contributor
  • 3 kudos

To get closer to the error:There is same mystical size limit.

  • 3 kudos
3 More Replies
reachrishav
by New Contributor II
  • 3443 Views
  • 2 replies
  • 0 kudos

XML to Parquet files

I have a requirement where I need to ingest large xml files and flatten the data before saving it as parquet files. I have created a python function to flatten the complex types (array & struct) from the ingested xml dataframe. I'm using the spark-xm...

  • 3443 Views
  • 2 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Hi @reachrishav ,Since 14.3 there is a native support for read and write XML files. Maybe check if it works faster than the library that you've used:Read and write XML files | Databricks on AWSAnd you've mentioned that you write python function to fl...

  • 0 kudos
1 More Replies
YS1
by Contributor
  • 1607 Views
  • 2 replies
  • 0 kudos

DLT - Importing Python Package

Hello,I'm creating a DLT pipeline where I read a Kafka stream, perform transformations using UDFs, and save the data in multiple tables. When I define the functions directly in the same notebook, the code works fine. However, if I move the code into ...

YS1_1-1723071739598.png YS1_0-1723071683421.png
  • 1607 Views
  • 2 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Hi @YS1 ,Have you added the python file in the Pipeline settings, in the list of source code?     

  • 0 kudos
1 More Replies
skolukmar
by New Contributor
  • 1411 Views
  • 2 replies
  • 0 kudos

Delta Live Tables: control microbatch size

A delta live table pipeline reads a delta table on databricks. Is it possible to limit the size of microbatch during data transformation?I am thinking about a solution used by spark structured streaming that enables control of batch size using:.optio...

  • 1411 Views
  • 2 replies
  • 0 kudos
Latest Reply
lprevost
Contributor III
  • 0 kudos

One other thought -- if you are considering using pandas_udf api, there is a way to control batch size there:pandas_udf guide   note the comments there about arrow batch size params.

  • 0 kudos
1 More Replies
gpierard
by New Contributor III
  • 24044 Views
  • 3 replies
  • 1 kudos

Resolved! how to list all spark session config variables

In databricks I can set a config variable at session level, but it is not found in the context variables:spark.conf.set(f"dataset.bookstore", '123') #dataset_bookstore spark.conf.get(f"dataset.bookstore")#123 scf = spark.sparkContext.getConf() allc =...

  • 24044 Views
  • 3 replies
  • 1 kudos
Latest Reply
RyanHager
Contributor
  • 1 kudos

A while back I think I found a way to get python to list all the config values.  I was not able to re-create it.  Just make one of your notebook code sections scala (first line) and use the second line: %scala(spark.conf.getAll).foreach(println)

  • 1 kudos
2 More Replies
Twilight
by Contributor
  • 1754 Views
  • 2 replies
  • 3 kudos

web terminal accessing /Workspace/Users under tmux

I found this old post (https://community.databricks.com/t5/data-engineering/databricks-cluster-web-terminal-different-permissions-with-tmux/td-p/26461) that was never really answered.I am having the same problem.  If I am in the raw terminal, I can a...

  • 1754 Views
  • 2 replies
  • 3 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 3 kudos

Hi @Twilight, To resolve this, ensure the `tmux` session runs under the same user context as the raw terminal, verify environment variables are set correctly, initialize `tmux` with the same shell and environment settings, check for any ACLs on the `...

  • 3 kudos
1 More Replies
Labels