cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

gb_dbx
by New Contributor II
  • 632 Views
  • 3 replies
  • 2 kudos

Does Databricks plan to create a Python API of the COPY INTO spark SQL statement in the future ?

Hi,I am wondering if Databricks has planned to create a Python API of spark SQL's COPY INTO statement ?In my company we created some kind of a Python wrapper of the SQL COPY INTO statement, but it has lots of design issues and is hard to maintain. I ...

  • 632 Views
  • 3 replies
  • 2 kudos
Latest Reply
gb_dbx
New Contributor II
  • 2 kudos

Okay maybe I should take a look at Auto Loader then, I didn't know Auto Loader could basically do the same as COPY INTO, I originally thought it was only used for streaming and not batch ingestion.And Auto Loader has a dedicated Python API then ?And ...

  • 2 kudos
2 More Replies
biafch
by Contributor
  • 654 Views
  • 2 replies
  • 2 kudos

How to load a json file in pyspark with colon character in folder name

Hi,I have a folder that contains subfolders that have json files.My subfolders look like this:2024-08-12T09:34:37:452Z2024-08-12T09:25:45:185ZI attach these subfolder names to a variable called FolderName and then try to read my json file like this:d...

  • 654 Views
  • 2 replies
  • 2 kudos
Latest Reply
szymon_dybczak
Contributor III
  • 2 kudos

Hi @biafch ,I've tried to replicate your example and it worked for me. But it seems that it is common problem and some object storage may not support that.[HADOOP-14217] Object Storage: support colon in object path - ASF JIRA (apache.org)Which object...

  • 2 kudos
1 More Replies
xhead
by New Contributor II
  • 17452 Views
  • 5 replies
  • 2 kudos

Does "databricks bundle deploy" clean up old files?

I'm looking at this page (Databricks Asset Bundles development work tasks) in the Databricks documentation.When repo assets are deployed to a databricks workspace, it is not clear if the "databricks bundle deploy" will remove files from the target wo...

Data Engineering
bundle
cli
deploy
  • 17452 Views
  • 5 replies
  • 2 kudos
Latest Reply
xhead
New Contributor II
  • 2 kudos

One further question:The purpose of “databricks bundle destroy” is to remove all previously-deployed jobs, pipelines, and artifacts that are defined in the bundle configuration files.Which bundle configuration files? The ones in the repo? Or are ther...

  • 2 kudos
4 More Replies
sarguido
by New Contributor II
  • 3386 Views
  • 5 replies
  • 2 kudos

Delta Live Tables: bulk import of historical data?

Hello! I'm very new to working with Delta Live Tables and I'm having some issues. I'm trying to import a large amount of historical data into DLT. However letting the DLT pipeline run forever doesn't work with the database we're trying to import from...

  • 3386 Views
  • 5 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Sarah Guido​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers y...

  • 2 kudos
4 More Replies
bulbur
by New Contributor II
  • 974 Views
  • 1 replies
  • 0 kudos

Use pandas in DLT pipeline

Hi,I am trying to work with pandas in a delta live table. I have created some example code: import pandas as pd import pyspark.sql.functions as F pdf = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo", "bar", "bar", "...

  • 974 Views
  • 1 replies
  • 0 kudos
Latest Reply
bulbur
New Contributor II
  • 0 kudos

I have taken the advice given by the documentation (However, you can include these functions outside of table or view function definitions because this code is run once during the graph initialization phase.) and moved the toPandas call to a function...

  • 0 kudos
Devsh_on_point
by New Contributor
  • 374 Views
  • 1 replies
  • 1 kudos

Liquid Clustering with Partitioning

Hi Team,Can we use Partitioning and Liquid Clustering in Conjunction? Essentially, partitioning the table first on a specific field and then apply liquid clustering (on other fields)?Alternatively, can we define the order priority of the cluster key ...

  • 374 Views
  • 1 replies
  • 1 kudos
Latest Reply
szymon_dybczak
Contributor III
  • 1 kudos

Hi @Devsh_on_point ,No, you cant have partitioning and liquid clustering on a table. You can treat liquid clustering as a more performant replacement of partitioning.And yes, you are correct. Order of cluster columns doesn't matter:"Databricks recomm...

  • 1 kudos
Prashanth24
by New Contributor III
  • 2342 Views
  • 5 replies
  • 1 kudos

Resolved! Difference between Liquid clustering and Z-ordering

I am trying to understand the difference between Liquid clustering and z-ordering. As per my understanding, both stores the clustered information into ZCubes which is of size 100 GB.Liquid Clustering maintains ZCube id in transaction log so when opti...

  • 2342 Views
  • 5 replies
  • 1 kudos
Latest Reply
Brahmareddy
Honored Contributor
  • 1 kudos

Hi Prashanth,Liquid Clustering only reorganizes parts of the data that aren't already clustered to make it more efficient. Z-Ordering, on the other hand, reorganizes the entire table or partitions every time, which is more resource-intensive.

  • 1 kudos
4 More Replies
vannipart
by New Contributor III
  • 790 Views
  • 1 replies
  • 1 kudos

Resolved! SparkOutOfMemoryError when merging data into a table that already has data

Hello, There is an issue with merging data from a dataframe into a table 2024 databricksJob aborted due to stage failure: Task 17 in stage 1770.0 failed 4 times, most recent failure: Lost task 17.3 in stage 1770.0 (TID 1669) (1x.xx.xx.xx executor 8):...

  • 790 Views
  • 1 replies
  • 1 kudos
Latest Reply
" src="" />
This widget could not be displayed.
This widget could not be displayed.
This widget could not be displayed.
  • 1 kudos

This widget could not be displayed.
Hello, There is an issue with merging data from a dataframe into a table 2024 databricksJob aborted due to stage failure: Task 17 in stage 1770.0 failed 4 times, most recent failure: Lost task 17.3 in stage 1770.0 (TID 1669) (1x.xx.xx.xx executor 8):...

This widget could not be displayed.
  • 1 kudos
This widget could not be displayed.
karthika
by New Contributor II
  • 903 Views
  • 1 replies
  • 0 kudos

Resolved! Databricks associate certification

 I encountered this experience while attempting my 1st DataBricks certification. Abruptly, Proctor asked me to show my desk, after showing he/she asked multiple times.. . My test got paused multiple times even when I am looking at my screenI want to ...

  • 903 Views
  • 1 replies
  • 0 kudos
Latest Reply
Aviral-Bhardwaj
Esteemed Contributor III
  • 0 kudos

@Cert-TeamOPS @Cert-Team  Please help this person For Now @karthika  use this  for filing a ticket with our support team. Please allow the support team 24-48 hours for a resolution. In the meantime, you can review the following documentation:Room req...

  • 0 kudos
hprasad
by New Contributor III
  • 5065 Views
  • 8 replies
  • 2 kudos

Spark read GZ file as corrupted data, when file extension having .GZ in upper case

if file is renamed with file_name.sv.gz (lower case extension) is working fine, if file_name.sv.GZ (upper case extension) the data is read as corrupted, means it simply reading compressed file as is. 

hprasad_0-1705667590987.png
Data Engineering
gzip files
spark-csv
spark.read.csv
  • 5065 Views
  • 8 replies
  • 2 kudos
Latest Reply
hprasad
New Contributor III
  • 2 kudos

Recently I restarted look at a solution for this issue, I found out we can add few exception for allowing "GZ" in hadoop library as GzipCodec is invoked from there.

  • 2 kudos
7 More Replies
NathanE
by New Contributor II
  • 4693 Views
  • 5 replies
  • 10 kudos

Java 21 support with Databricks JDBC driver

Hello,I was wondering if there was any timeline for Java 21 support with the Databricks JDBC driver (current version is 2.34).One of the required change is to update the dependency to arrow to version 13.0 (current version is 9.0.0).The current worka...

Data Engineering
driver
java21
JDBC
  • 4693 Views
  • 5 replies
  • 10 kudos
Latest Reply
151640
New Contributor III
  • 10 kudos

Regarding EnableArrow, it is not discussed in the PDF SIMBA distributes with the driver, it is buried in the release notes.txt file. It may be required to avoid corrupted character values being returned from the driver. This can occur with JRE 8. Hop...

  • 10 kudos
4 More Replies
vjani
by New Contributor III
  • 964 Views
  • 3 replies
  • 5 kudos

Resolved! Global init script not running

Hello Databricks Community,I am trying to connect databricks with datadog and have added datadog agent script in global init but it did not worked. Just to make sure if init script is working or not I have added below two lined of code in global init...

  • 964 Views
  • 3 replies
  • 5 kudos
Latest Reply
vjani
New Contributor III
  • 5 kudos

Thanks Slash for the reply. That seems to be a reason. I was following https://docs.datadoghq.com/integrations/databricks/?tab=driveronly and missed that configuration.

  • 5 kudos
2 More Replies
Zeruno
by New Contributor II
  • 284 Views
  • 0 replies
  • 0 kudos

UDFs with modular code - INVALID_ARGUMENT

I am migrating a massive codebase to Pyspark on Azure Databricks,using DLT Pipelines. It is very important that code will be modular, that is I am looking to make use of UDFs for the timebeing that use modules and classes.I am receiving the following...

  • 284 Views
  • 0 replies
  • 0 kudos
anand_k
by New Contributor II
  • 371 Views
  • 1 replies
  • 1 kudos

Variant Support in SQL Alchemy

Databricks now supports the VARIANT data type, which works well in the UI and within Spark environments. However, when working with SQLAlchemy, the VARIANT type doesn't seem to be fully implemented in the latest databricks-sql-connector[sqlalchemy]. ...

  • 371 Views
  • 1 replies
  • 1 kudos
Latest Reply
Witold
Honored Contributor
  • 1 kudos

This is actually an open source project. By looking at the code, it seems that VARIANT is not yet supported. Depending on your knowledge of the code base, you could create an own PR. Or just open an issue there, and wait for the support of the devs.

  • 1 kudos
turtleXturtle
by New Contributor II
  • 463 Views
  • 0 replies
  • 1 kudos

Delta sharing speed

Hi - I am comparing the performance of delta shared tables and the speed is 10X slower than when querying locally.Scenario:I am using a 2XS serverless SQL warehouse, and have a table with 15M rows and 10 columns, using the below query:select date, co...

  • 463 Views
  • 0 replies
  • 1 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels