Data Engineering

Forum Posts

Sorted by:

by AdamRink • New Contributor III

02-23-2022 9:45:08 AM

2107 Views
1 replies
0 kudos

Apply Avro defaults when writing to Confluent Kafka

I have an avro schema for my Kafka topic. In that schema it has defaults. I would like to exclude the defaulted columns from databricks and just let them default as an empty array. Sample avro, trying to not provide the UserFields because I can't...

Data Engineering

2107 Views
1 replies
0 kudos

02-23-2022 9:45:08 AM

View Replies

by zak • New Contributor II

02-16-2023 11:03:51 AM

4285 Views
1 replies
1 kudos

add custom metadata to avro file with pyspark

Hello, i need to add a custom metadata into a avro file. The avro file containt data. we have tried to use "option" within the write function but it's not taken without generated any error.df.write.format("avro").option("avro.codec", "snappy").option...

Data Engineering

4285 Views
1 replies
1 kudos

02-16-2023 11:03:51 AM

View Replies

by dheeraj2444 • New Contributor II

01-11-2023 7:53:04 PM

2817 Views
3 replies
0 kudos

I am trying to write a data frame to Kafka topic with Avro schema for key and value using a schema registry URL. The to_avro function is not writing t...

I am trying to write a data frame to Kafka topic with Avro schema for key and value using a schema registry URL. The to_avro function is not writing to the topic and throwing an exception with code 40403 something. Is there an alternate way to do thi...

Data Engineering

2817 Views
3 replies
0 kudos

01-11-2023 7:53:04 PM

View Replies

Latest Reply

Debayan
Databricks Employee

01-12-2023 2:16:13 PM

0 kudos

Hi,Could you please refer to https://github.com/confluentinc/kafka-connect-elasticsearch/issues/59 and let us know if this helps.

0 kudos

01-12-2023 2:16:13 PM

2 More Replies

by Gilg • Contributor II

12-13-2022 9:07:10 PM

5698 Views
4 replies
5 kudos

Avro Deserialization from Event Hub capture and Autoloader

Hi All,I am getting data from Event Hub capture in Avro format and using Auto Loader to process it.I get into the point where I can read the Avro by casting the Body into a string.Now I wanted to deserialized the Body column so it will in table forma...

Data Engineering

5698 Views
4 replies
5 kudos

12-13-2022 9:07:10 PM

View Replies

Latest Reply

UmaMahesh1
Honored Contributor III

12-13-2022 9:43:46 PM

5 kudos

If you still want to go with the above approach and don't want to provide schema manually, then you can fetch a tiny batch with 1 record and build the schema into a variable using a .schema option. Once done, you can add a new Body column by providin...

5 kudos

12-13-2022 9:43:46 PM

3 More Replies

by palzor • New Contributor III

08-14-2022 2:24:43 PM

958 Views
0 replies
2 kudos

What is the best practice while loading delta table , do I infer the schema or provide the schema?

I am loading avro files into the detla tables. I am doing this for multiple tables and some files are big like (2-3GB) and most of them are small like in few MBs.I am using autoloader to load the data into the delta tables.My question is:What is the ...

Data Engineering

958 Views
0 replies
2 kudos

08-14-2022 2:24:43 PM

by sage5616 • Valued Contributor

08-03-2022 3:06:05 PM

19028 Views
3 replies
2 kudos

Resolved! Choosing the optimal cluster size/specs.

Hello everyone,I am trying to determine the appropriate cluster specifications/sizing for my workload:Run a PySpark task to transform a batch of input avro files to parquet files and create or re-create persistent views on these parquet files. This t...

Data Engineering

19028 Views
3 replies
2 kudos

08-03-2022 3:06:05 PM

View Replies

Latest Reply

Anonymous
Not applicable

08-07-2022 1:25:11 PM

2 kudos

If the data is 100MB, then I'd try a single node cluster, which will be the smallest and least expensive. You'll have more than enough memory to store it all. You can automate this and use a jobs cluster.

2 kudos

08-07-2022 1:25:11 PM

2 More Replies

by User16826994223 • Honored Contributor III

06-25-2021 8:43:48 AM

1151 Views
0 replies
0 kudos

How do we decide between Avro and parquet which file format would help more

I Know most of the time parquets file is great for different workload, but I still see Avro files are in use , What type of scenario where avro would be great to use over parquet format.

Data Engineering

1151 Views
0 replies
0 kudos

06-25-2021 8:43:48 AM

by microamp • New Contributor II

01-26-2018 2:52:59 AM

13051 Views
12 replies
0 kudos

Azure Data Lake Config Issue: No value for dfs.adls.oauth2.access.token.provider found in conf file.

Hi,I have files hosted on an Azure Data Lake Store which I can connect from Azure Databricks configured as per instructions here.I can read JSON files fine, however, I'm getting the following error when I try to read an Avro file.spark.read.format("c...

Data Engineering

13051 Views
12 replies
0 kudos

01-26-2018 2:52:59 AM

View Replies

Latest Reply

User16301467523
New Contributor II

06-11-2018 3:46:47 PM

0 kudos

Taras's answer is correct. Because spark-avro is based on the RDD APIs, the properties must be set in the hadoopConfiguration options. Please note these docs for configuration using the RDD API: https://docs.azuredatabricks.net/spark/latest/data-sou...

0 kudos

06-11-2018 3:46:47 PM

11 More Replies