cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

raduq
by Contributor
  • 31999 Views
  • 13 replies
  • 12 kudos

How to efficiently process a 50Gb JSON file and store it in Delta?

Hi, I'm a fairly new user and I am using Azure Databricks to process a ~50Gb JSON file containing real estate data. I uploaded the JSON file to Azure Data Lake Gen2 storage and read the JSON file into a dataframe.df = spark.read.option('multiline', '...

image image image
  • 31999 Views
  • 13 replies
  • 12 kudos
Latest Reply
Renzer
New Contributor II
  • 12 kudos

The spark connector is super slow. I found loading json into Azure cosmos dB then writing queries to get sections of data out was 25x times faster because cosmos dB indexes the json. You can stream read data from cosmosdb. You can find python code sn...

  • 12 kudos
12 More Replies
PK225
by New Contributor III
  • 1189 Views
  • 2 replies
  • 1 kudos
  • 1189 Views
  • 2 replies
  • 1 kudos
Latest Reply
Vartika
Moderator
  • 1 kudos

Hi @Pavan Kumar​,Hope you are well. Just wanted to see if you were able to find an answer to your question and would you like to mark an answer as best? It would be really helpful for the other members too.Cheers!

  • 1 kudos
1 More Replies
konda1
by New Contributor
  • 765 Views
  • 0 replies
  • 0 kudos

Getting Executor lost due to stage failure error on writing data frame to a delta table or any file like parquet or csv or avro

We are working on multiline nested ( multilevel).The file is read and flattened using pyspark and the data frame is showing data using display() method. when saving the same dataframe it is giving executor lost failure error.for some files it is givi...

  • 765 Views
  • 0 replies
  • 0 kudos
kk007
by New Contributor III
  • 2503 Views
  • 4 replies
  • 4 kudos

Photon engine throws error "JSON document exceeded maximum allowed size 400.0 MiB"

I am reading a 83MB json file using " spark.read.json(storage_path)", when I display the data is seems displaying fine, but when I try command line count, it complains about file size , being more than 400MB, which is not true.Photon JSON reader erro...

  • 2503 Views
  • 4 replies
  • 4 kudos
Latest Reply
Anonymous
Not applicable
  • 4 kudos

@Kamal Kumar​ :The error message suggests that the JSON document size is exceeding the maximum allowed size of 400MB. This could be caused by one or more documents in your JSON file being larger than this limit. It is not a bug, but a limitation set ...

  • 4 kudos
3 More Replies
AmineHY
by Contributor
  • 7685 Views
  • 7 replies
  • 9 kudos

Resolved! How to read JSON files embedded in a list of lists?

HelloI am trying to read this JSON file but didn't succeed  You can see the head of the file, JSON inside a list of lists. Any idea how to read this file?

image image image
  • 7685 Views
  • 7 replies
  • 9 kudos
Latest Reply
AmineHY
Contributor
  • 9 kudos

Here is my solution, I am sure it can be optimizedimport json  data=[] with open(path_to_json_file, 'r') as f:   data.extend(json.load(f))   df = spark.createDataFrame(data[0], schema=schema)

  • 9 kudos
6 More Replies
Aran_Oribu
by New Contributor II
  • 3461 Views
  • 5 replies
  • 2 kudos

Resolved! Create and update a csv/json file in ADLSG2 with Eventhub in Databricks streaming

Hello ,This is my first post here and I am a total beginner with DataBricks and spark.Working on an IoT Cloud project with azure , I'm looking to set up a continuous stream processing of data.A current architecture already exists thanks to Stream Ana...

  • 3461 Views
  • 5 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

So the event hub creates files (json/csv) on adls.You can read those files into databricks with the spark.read.csv/json method. If you want to read many files in one go, you can use wildcards.f.e. spark.read.json("/mnt/datalake/bronze/directory/*/*...

  • 2 kudos
4 More Replies
BeginnerBob
by New Contributor III
  • 16673 Views
  • 4 replies
  • 2 kudos

Flatten a complex JSON file and load into a delta table

Hi,I am loading a JSON file into Databricks by simply doing the following:from pyspark.sql.functions import *from pyspark.sql.types import *bronze_path="wasbs://....../140477.json"df_incremental = spark.read.option("multiline","true").json(bronze_pat...

image
  • 16673 Views
  • 4 replies
  • 2 kudos
Latest Reply
Vidula
Honored Contributor
  • 2 kudos

Hi @Lloyd Vickery​ Does @Werner Stinckens​  response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?We'd love to hear from you.Thanks!

  • 2 kudos
3 More Replies
laus
by New Contributor III
  • 7006 Views
  • 7 replies
  • 3 kudos

Resolved! How to load a json file in pyspark with colon character in file name

Hi,I'm trying to load this json file which contains the colon character in its name: file_name.2022-03-05_11:30:00.json but I get the error in screenshot below saying that there is a relative path in an absolute url - Any idea how to read this file...

image
  • 7006 Views
  • 7 replies
  • 3 kudos
Latest Reply
Noopur_Nigam
Valued Contributor II
  • 3 kudos

Hi @Laura Blancarte​ I hope that @Pearl Ubaru​'s answer would have helped you in resolving your issue.Please let us know if you need more help on this.

  • 3 kudos
6 More Replies
Orianh
by Valued Contributor II
  • 6617 Views
  • 7 replies
  • 3 kudos

Resolved! Read JSON with backslash.

Hello guys.I'm trying to read JSON file which contains backslash and failed to read it via pyspark.Tried a lot of options but didn't solve this yet, I thought to read all the JSON as text and replace all "\" with "/" but pyspark fail to read it as te...

  • 6617 Views
  • 7 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

@orian hindi​ - Would you be happy to post the solution you came up with and then mark it as best? That will help other members.

  • 3 kudos
6 More Replies
Labels