cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

MrsBaker
by New Contributor II
  • 651 Views
  • 1 replies
  • 1 kudos

display() not updating after 1000 rows

Hello folks! I am calling display() on a streaming query sourced from a delta table. The output from display() displays the new rows added to the source table. But as soon as the output results hit 1000 rows, the output is not updated anymore. As a r...

  • 651 Views
  • 1 replies
  • 1 kudos
Latest Reply
MrsBaker
New Contributor II
  • 1 kudos

aggregate function followed by timestamp field sorted in descending order did the trick:streaming_df.groupBy("field1", "time_field").max("field2").orderBy(col("time_field").desc()).display()

  • 1 kudos
KateK
by New Contributor II
  • 677 Views
  • 3 replies
  • 1 kudos

How do you correctly access the spark context in DLT pipelines?

I have some code that uses RDDs, and the sc.parallelize() and rdd.toDF() methods to get a dataframe back out. The code works in a regular notebook (and if I run the notebook as a job) but fails if I do the same thing in a DLT pipeline. The error mess...

  • 677 Views
  • 3 replies
  • 1 kudos
Latest Reply
KateK
New Contributor II
  • 1 kudos

Thanks for your help Alex, I ended up re-writing my code with spark UDFs -- maybe there is a better solution with only the Dataframe API but I couldn't find it. To summarize my problem: I was trying to un-nest a large json blob (the fake data in my f...

  • 1 kudos
2 More Replies
palak231
by New Contributor
  • 240 Views
  • 0 replies
  • 0 kudos

A/B Testing:-  A/B testing is the process of comparing two variations as well as two versions of same item and offer the best one in between both of t...

A/B Testing:-  A/B testing is the process of comparing two variations as well as two versions of same item and offer the best one in between both of them. Well before doing A/B testing you need to be focus on one problem that you want to resolve and ...

  • 240 Views
  • 0 replies
  • 0 kudos
Anonymous
by Not applicable
  • 392 Views
  • 1 replies
  • 4 kudos

Happy August! �� On August 25th we are hosting another Community Social - we're doing these monthly ! We want to make sure that we all have...

Happy August! On August 25th we are hosting another Community Social - we're doing these monthly ! We want to make sure that we all have the chance to connect as a community often. Come network, talk data, and just get social! Join us for our August ...

  • 392 Views
  • 1 replies
  • 4 kudos
Latest Reply
Kaniz
Community Manager
  • 4 kudos

Wow! Super Exciting.

  • 4 kudos
Rajendra
by New Contributor II
  • 774 Views
  • 0 replies
  • 2 kudos

Does databricks support writing the data in Iceberg format?

As I understand databricks supports conversion from Iceberg format to Delta using the command belowCONVERT TO DELTA iceberg.`abfss://container-name@storage-account-name.dfs.core.windows.net/path/to/table`; -- uses Iceberg manifest for metadataHowever...

  • 774 Views
  • 0 replies
  • 2 kudos
jakubk
by Contributor
  • 358 Views
  • 0 replies
  • 0 kudos

databricks spark sql Custom table valued function + struct really slow (minutes for a single row)

I'm using azure databricksI have a custom table valued function which takes a URL as a parameter and outputs a single row table with certain elements from the URL extracted/labelled(i get search activity URLs and when in a specific format I can retri...

  • 358 Views
  • 0 replies
  • 0 kudos
sage5616
by Valued Contributor
  • 8432 Views
  • 3 replies
  • 2 kudos

Resolved! Choosing the optimal cluster size/specs.

Hello everyone,I am trying to determine the appropriate cluster specifications/sizing for my workload:Run a PySpark task to transform a batch of input avro files to parquet files and create or re-create persistent views on these parquet files. This t...

  • 8432 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

If the data is 100MB, then I'd try a single node cluster, which will be the smallest and least expensive. You'll have more than enough memory to store it all. You can automate this and use a jobs cluster.

  • 2 kudos
2 More Replies
Ross
by New Contributor II
  • 1425 Views
  • 4 replies
  • 0 kudos

Failed to install cluster scoped SparkR library

Attempting to install SparkR to the cluster and successfully installed other packages such as tidyverse via CRAN. The error is copied below, any help you can provide is greatly appreciated!Databricks runtime 10.4 LTSLibrary installation attempted on ...

  • 1425 Views
  • 4 replies
  • 0 kudos
Latest Reply
Vivian_Wilfred
Honored Contributor
  • 0 kudos

Hi @Ross Hamilton​ ,I believe SparkR comes inbuilt with Databricks RStudio and you don't have to install it explicitly. You can directly import it with library(SparkR) and it works for you from your above comment.The error message you see could be re...

  • 0 kudos
3 More Replies
Anonymous
by Not applicable
  • 282 Views
  • 1 replies
  • 1 kudos

The Next Databricks Office HoursOur next Office Hours session is scheduled for February 23, 2022 - 8:00 am PDT Do you have questions about how to set ...

The Next Databricks Office HoursOur next Office Hours session is scheduled for February 23, 2022 - 8:00 am PDTDo you have questions about how to set up or use Databricks? Do you want to get best practices for deploying your use case or tips on data a...

  • 282 Views
  • 1 replies
  • 1 kudos
Latest Reply
Kaniz
Community Manager
  • 1 kudos

Great!

  • 1 kudos
ftc
by New Contributor II
  • 1213 Views
  • 3 replies
  • 0 kudos

Resolved! Multi-Hop Architecture for ingestion data via http API

I'd like to know what is the design pattern for ingesting data via http API request. The pattern needs use multi-hop architecture. Do we need ingest JSON output to cloud storage first (not bronze layer), then use auto loader to process data further? ...

  • 1213 Views
  • 3 replies
  • 0 kudos
Latest Reply
artsheiko
Valued Contributor II
  • 0 kudos

The API -> Cloud Storage -> Delta is more suitable approach.Auto Loader helps not to lose any data (it keeps track of discovered files in the checkpoint location using RocksDB to provide exactly-once ingestion guarantees), enables schema inference ev...

  • 0 kudos
2 More Replies
ASN
by New Contributor II
  • 5738 Views
  • 5 replies
  • 2 kudos

Python Read csv - Don't consider comma when its within the quotes, even if the quotes are not immediate to the separator

I have data like, below and when reading as CSV, I don't want to consider comma when its within the quotes even if the quotes are not immediate to the separator (like record #2). 1 and 3 records are good if we use separator, but failing on 2nd record...

Input and expected Output
  • 5738 Views
  • 5 replies
  • 2 kudos
Latest Reply
Pholo
Contributor
  • 2 kudos

Hi, I think you can use this option for the csvReadeespark.read.options(header = True, sep = ",", unescapedQuoteHandling = "BACK_TO_DELIMITER").csv("your_file.csv")especially the unescapedQuoteHandling. You can search for the other options at this l...

  • 2 kudos
4 More Replies
Rahul_Samant
by Contributor
  • 1516 Views
  • 4 replies
  • 1 kudos

Resolved! Spark Sql Connector :

i am trying to read data from azure sql database from databricks. azure sql database is created with private link endpoint.Using DBR 10.4 LTS Cluster and expectation is the connector is pre installed as per documentation.using the below code to fetch...

  • 1516 Views
  • 4 replies
  • 1 kudos
Latest Reply
artsheiko
Valued Contributor II
  • 1 kudos

It seems that .option("databaseName", "test") is redundant here as you need to include the db name in the url.Please verify that you use a connector compatible to your cluster's Spark version : Apache Spark connector: SQL Server & Azure SQL

  • 1 kudos
3 More Replies
Anonymous
by Not applicable
  • 493 Views
  • 1 replies
  • 3 kudos

March Madness + Data  Here at Databricks we like to use (you guessed it) data in our daily lives. Today kicks off a series called Databrags �� ...

March Madness + Data Here at Databricks we like to use (you guessed it) data in our daily lives. Today kicks off a series called Databrags Databrags are glimpses into how Bricksters and community folks like you use data to solve everyday problems, e...

  • 493 Views
  • 1 replies
  • 3 kudos
Latest Reply
Kaniz
Community Manager
  • 3 kudos

@Lindsay Olson​, Awesome!

  • 3 kudos
mick042
by New Contributor III
  • 460 Views
  • 1 replies
  • 0 kudos

Does spark utilise a temporary stage when writing to snowflake? How does that work?

Folks , when I want to push data to snowflake I need to use a stage for files before copying data over. However, when I utilise the net.snowflake.spark.snowflake.Utils library and do a spark.write as in...spark.read.format("csv") .option("header", ...

  • 460 Views
  • 1 replies
  • 0 kudos
Latest Reply
mick042
New Contributor III
  • 0 kudos

Yes it uses a temporary stage. should have just looked in snowflake history

  • 0 kudos