cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

palak231
by New Contributor
  • 1444 Views
  • 0 replies
  • 0 kudos

A/B Testing:-  A/B testing is the process of comparing two variations as well as two versions of same item and offer the best one in between both of t...

A/B Testing:-  A/B testing is the process of comparing two variations as well as two versions of same item and offer the best one in between both of them. Well before doing A/B testing you need to be focus on one problem that you want to resolve and ...

  • 1444 Views
  • 0 replies
  • 0 kudos
Rajendra
by New Contributor II
  • 1300 Views
  • 0 replies
  • 2 kudos

Does databricks support writing the data in Iceberg format?

As I understand databricks supports conversion from Iceberg format to Delta using the command belowCONVERT TO DELTA iceberg.`abfss://container-name@storage-account-name.dfs.core.windows.net/path/to/table`; -- uses Iceberg manifest for metadataHowever...

  • 1300 Views
  • 0 replies
  • 2 kudos
jakubk
by Contributor
  • 1034 Views
  • 0 replies
  • 0 kudos

databricks spark sql Custom table valued function + struct really slow (minutes for a single row)

I'm using azure databricksI have a custom table valued function which takes a URL as a parameter and outputs a single row table with certain elements from the URL extracted/labelled(i get search activity URLs and when in a specific format I can retri...

  • 1034 Views
  • 0 replies
  • 0 kudos
sage5616
by Valued Contributor
  • 19152 Views
  • 3 replies
  • 2 kudos

Resolved! Choosing the optimal cluster size/specs.

Hello everyone,I am trying to determine the appropriate cluster specifications/sizing for my workload:Run a PySpark task to transform a batch of input avro files to parquet files and create or re-create persistent views on these parquet files. This t...

  • 19152 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

If the data is 100MB, then I'd try a single node cluster, which will be the smallest and least expensive. You'll have more than enough memory to store it all. You can automate this and use a jobs cluster.

  • 2 kudos
2 More Replies
Ross
by New Contributor II
  • 3743 Views
  • 4 replies
  • 0 kudos

Failed to install cluster scoped SparkR library

Attempting to install SparkR to the cluster and successfully installed other packages such as tidyverse via CRAN. The error is copied below, any help you can provide is greatly appreciated!Databricks runtime 10.4 LTSLibrary installation attempted on ...

  • 3743 Views
  • 4 replies
  • 0 kudos
Latest Reply
Vivian_Wilfred
Databricks Employee
  • 0 kudos

Hi @Ross Hamilton​ ,I believe SparkR comes inbuilt with Databricks RStudio and you don't have to install it explicitly. You can directly import it with library(SparkR) and it works for you from your above comment.The error message you see could be re...

  • 0 kudos
3 More Replies
ftc
by New Contributor II
  • 3305 Views
  • 3 replies
  • 0 kudos

Resolved! Multi-Hop Architecture for ingestion data via http API

I'd like to know what is the design pattern for ingesting data via http API request. The pattern needs use multi-hop architecture. Do we need ingest JSON output to cloud storage first (not bronze layer), then use auto loader to process data further? ...

  • 3305 Views
  • 3 replies
  • 0 kudos
Latest Reply
artsheiko
Databricks Employee
  • 0 kudos

The API -> Cloud Storage -> Delta is more suitable approach.Auto Loader helps not to lose any data (it keeps track of discovered files in the checkpoint location using RocksDB to provide exactly-once ingestion guarantees), enables schema inference ev...

  • 0 kudos
2 More Replies
ASN
by New Contributor II
  • 15302 Views
  • 5 replies
  • 2 kudos

Python Read csv - Don't consider comma when its within the quotes, even if the quotes are not immediate to the separator

I have data like, below and when reading as CSV, I don't want to consider comma when its within the quotes even if the quotes are not immediate to the separator (like record #2). 1 and 3 records are good if we use separator, but failing on 2nd record...

Input and expected Output
  • 15302 Views
  • 5 replies
  • 2 kudos
Latest Reply
Pholo
Contributor
  • 2 kudos

Hi, I think you can use this option for the csvReadeespark.read.options(header = True, sep = ",", unescapedQuoteHandling = "BACK_TO_DELIMITER").csv("your_file.csv")especially the unescapedQuoteHandling. You can search for the other options at this l...

  • 2 kudos
4 More Replies
Rahul_Samant
by Contributor
  • 5334 Views
  • 4 replies
  • 1 kudos

Resolved! Spark Sql Connector :

i am trying to read data from azure sql database from databricks. azure sql database is created with private link endpoint.Using DBR 10.4 LTS Cluster and expectation is the connector is pre installed as per documentation.using the below code to fetch...

  • 5334 Views
  • 4 replies
  • 1 kudos
Latest Reply
artsheiko
Databricks Employee
  • 1 kudos

It seems that .option("databaseName", "test") is redundant here as you need to include the db name in the url.Please verify that you use a connector compatible to your cluster's Spark version : Apache Spark connector: SQL Server & Azure SQL

  • 1 kudos
3 More Replies
mick042
by New Contributor III
  • 1577 Views
  • 1 replies
  • 0 kudos

Does spark utilise a temporary stage when writing to snowflake? How does that work?

Folks , when I want to push data to snowflake I need to use a stage for files before copying data over. However, when I utilise the net.snowflake.spark.snowflake.Utils library and do a spark.write as in...spark.read.format("csv") .option("header", ...

  • 1577 Views
  • 1 replies
  • 0 kudos
Latest Reply
mick042
New Contributor III
  • 0 kudos

Yes it uses a temporary stage. should have just looked in snowflake history

  • 0 kudos
165036
by New Contributor III
  • 2268 Views
  • 3 replies
  • 1 kudos

Resolved! Error message when editing schedule cron expression on job

When attempting to edit the schedule cron expression on one of our jobs we receive the following error message:Cluster validation error: Validation failed for spark_conf, spark.databricks.acl.dfAclsEnabled must be false (is "true") The spark.databric...

  • 2268 Views
  • 3 replies
  • 1 kudos
Latest Reply
165036
New Contributor III
  • 1 kudos

FYI this was a temporary Databricks bug. Seems to be resolved now.

  • 1 kudos
2 More Replies
Anonymous
by Not applicable
  • 907 Views
  • 0 replies
  • 4 kudos

Happy August! �� On August 25th we are hosting another Community Social - we're doing these monthly ! We want to make sure that we all have...

Happy August! On August 25th we are hosting another Community Social - we're doing these monthly ! We want to make sure that we all have the chance to connect as a community often. Come network, talk data, and just get social! Join us for our August ...

  • 907 Views
  • 0 replies
  • 4 kudos
AP
by New Contributor III
  • 4690 Views
  • 5 replies
  • 3 kudos

Resolved! AutoOptimize, OPTIMIZE command and Vacuum command : Order, production implementation best practices

So databricks gives us great toolkit in the form optimization and vacuum. But, in terms of operationaling them, I am really confused on the best practice.Should we enable "optimized writes" by setting the following at a workspace level?spark.conf.set...

  • 4690 Views
  • 5 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

@AKSHAY PALLERLA​ Just checking in to see if you got a solution to the issue you shared above. Let us know!Thanks to @Werner Stinckens​ for jumping in, as always!

  • 3 kudos
4 More Replies
Jayesh
by New Contributor III
  • 2971 Views
  • 5 replies
  • 3 kudos

Resolved! How can we do data copy from Databricks SQL using notebook?

Hi Team, we have a scenario where we have to connect to the DataBricks SQL instance 1 from another DataBricks instance 2 using notebook or Azure Data Factory. Can you please help?

  • 2971 Views
  • 5 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

Thanks for jumping in to help @Arvind Ravish​  @Hubert Dudek​ and @Artem Sheiko​ !

  • 3 kudos
4 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels