cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Lonnie
by New Contributor
  • 2005 Views
  • 0 replies
  • 0 kudos

Recommended Redshift-2-Delta Migration Path

Hello All!My team is previewing Databricks and are contemplating the steps to take to perform one-time migrations of datasets from Redshift to Delta. Based on our understandings of the tool, here are our initial thoughts:Export data from Redshift-2-S...

  • 2005 Views
  • 0 replies
  • 0 kudos
Surendra
by New Contributor III
  • 8486 Views
  • 3 replies
  • 6 kudos

Resolved! Databricks notebook is taking 2 hours to write to /dbfs/mnt (blob storage). Same job is taking 8 minutes to write to /dbfs/FileStore. I would like to understand why write performance is different in both cases.

Problem statement:Source file format : .tar.gzAvg size: 10 mbnumber of tar.gz files: 1000Each tar.gz file contails around 20000 csv files.Requirement : Untar the tar.gz file and write CSV files to blob storage / intermediate storage layer for further...

databricks_write_to_dbfsMount databricks_write_to_dbfsMount
  • 8486 Views
  • 3 replies
  • 6 kudos
Latest Reply
Surendra
New Contributor III
  • 6 kudos

@Hubert Dudek​  Thanks for your suggestions.After creating storage account in same region as databricks I can see that performance is as expected.Now it is clear that issue is with /mnt/ location is being in different region than databricks. I would ...

  • 6 kudos
2 More Replies
MartinB
by Contributor III
  • 22224 Views
  • 16 replies
  • 3 kudos

Does partition pruning / partition elimination not work for folder partitioned JSON files? (Spark 3.1.2)

Imagine the following setup:I have log files stored as JSON files partitioned by year, month, day and hour in physical folders:""" /logs |-- year=2020 |-- year=2021 `-- year=2022 |-- month=01 `-- month=02 |-- day=01 |-- day=.....

  • 22224 Views
  • 16 replies
  • 3 kudos
Latest Reply
MartinB
Contributor III
  • 3 kudos

@Kaniz Fatma​  could you maybe involve a Databricks expert?

  • 3 kudos
15 More Replies
al_joe
by Contributor
  • 3810 Views
  • 2 replies
  • 0 kudos

Where / how does DBFS store files?

I tried to use %fs head to print the contents of a CSV file used in a training%fs head "/mnt/path/file.csv"but got an error saying cannot head a directory!?Then I did %fs ls on the same CSV file and got a list of 4 files under a directory named as a ...

screenshot image
  • 3810 Views
  • 2 replies
  • 0 kudos
Latest Reply
User16753725182
Databricks Employee
  • 0 kudos

Hi @Al Jo​ , are you still seeing the error while printing the contents of te CSV file?

  • 0 kudos
1 More Replies
irfanaziz
by Contributor II
  • 3077 Views
  • 3 replies
  • 1 kudos

Resolved! What is the difference between passing the schema in the options or using the .schema() function in pyspark for a csv file?

I have observed a very strange behavior with some of our integration pipelines. This week one of the csv files was getting broken when read with read function given below.def ReadCSV(files,schema_struct,header,delimiter,timestampformat,encode="utf8...

  • 3077 Views
  • 3 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 1 kudos

Hi @nafri A​ ,What is the error you are getting, can you share it please? Like @Hubert Dudek​ mentioned, both will call the same APIs

  • 1 kudos
2 More Replies
CleverAnjos
by New Contributor III
  • 5662 Views
  • 5 replies
  • 3 kudos

Resolved! Best way of loading several csv files in a table

What would be the best way of loading several files like in a single table to be consumed?https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-10.csvhttps://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-11.csvhttps://s3.amazonaws...

  • 5662 Views
  • 5 replies
  • 3 kudos
Latest Reply
CleverAnjos
New Contributor III
  • 3 kudos

Thanks Kaniz, I already have the files. I was discussing about the best way to load them

  • 3 kudos
4 More Replies
Hola1801
by New Contributor
  • 1885 Views
  • 3 replies
  • 3 kudos

Resolved! Float Value change when Load with spark? Full Path?

Hello,I have created my table in Databricks, at this point everything is perfect i got the same value than in my CSV. for my column "Exposure" I have :0 0,00 1 0,00 2 0,00 3 0,00 4 0,00 ...But when I load my fi...

  • 1885 Views
  • 3 replies
  • 3 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 3 kudos

Hi @Anis Ben Salem​ ,How do you read your CSV file? do you use Pandas or Pyspark APIs? also, how do you created your table?could you share more details on the code you are trying to run?

  • 3 kudos
2 More Replies
Gapy
by New Contributor II
  • 1486 Views
  • 1 replies
  • 1 kudos

Auto Loader Schema-Inference and Evolution for parquet files

Dear all,will (and when) will Auto Loader also support Schema-Inference and Evolution for parquet files, at this point it is only for JSON and CSV supported if i am not mistaken?Thanks and regards,Gapy

  • 1486 Views
  • 1 replies
  • 1 kudos
Latest Reply
Sandeep
Contributor III
  • 1 kudos

@Gasper Zerak​ , This will be available in near future (DBR 10.3 or later). Unfortunately, we don't have an SLA at this moment.

  • 1 kudos
dataslicer
by Contributor
  • 9361 Views
  • 4 replies
  • 4 kudos

Resolved! Unable to save Spark Dataframe to driver node's local file system as CSV file

Running Azure Databricks Enterprise DBR 8.3 ML running on a single node, with Python notebook. I have 2 small Spark dataframes that I am able source via credential passthrough reading from ADLSgen2 via `abfss://` method and display the full content ...

  • 9361 Views
  • 4 replies
  • 4 kudos
Latest Reply
Dan_Z
Databricks Employee
  • 4 kudos

Modern Spark operates by a design choice to separate storage and compute. So saving a csv to the river's local disk doesn't make sense for a few reasons:the worker nodes don't have access to the driver's disk. They would need to send the data over to...

  • 4 kudos
3 More Replies
halfwind22
by New Contributor III
  • 10475 Views
  • 9 replies
  • 10 kudos

Resolved! Unable to write csv files to Azure BLOB using pandas to_csv ()

I am using a Py function to read some data from a GET endpoint and write them as a CSV file to a Azure BLOB location.My GET endpoint takes 2 query parameters,param1 and param2. So initially, I have a dataframe paramDf that has two columns param1 and ...

  • 10475 Views
  • 9 replies
  • 10 kudos
Latest Reply
halfwind22
New Contributor III
  • 10 kudos

@Hubert Dudek​ I cant issue a spark command to executor node, throws up an error ,because foreach distributes the processing.

  • 10 kudos
8 More Replies
tarente
by New Contributor III
  • 1458 Views
  • 2 replies
  • 3 kudos

Resolved! How to create a csv using a Scala notebook that as " in some columns?

In a project we use Azure Databricks to create csv files to be loaded in ThoughtSpot.Below is a sample to the code I use to write the file:val fileRepartition = 1 val fileFormat = "csv" val fileSaveMode = "overwrite" var fileOptions = Map ( ...

  • 1458 Views
  • 2 replies
  • 3 kudos
Latest Reply
tarente
New Contributor III
  • 3 kudos

Hi Shan,Thanks for the link.I now know more options for creating different csv files.I have not yet completed the problem, but that is related with a destination application (ThoughtSpot) not being able to load the data in the csv file correctly.Rega...

  • 3 kudos
1 More Replies
Rodrigo_Brandet
by New Contributor
  • 3745 Views
  • 3 replies
  • 4 kudos

Resolved! Upload CSV files on Databricks by code (note UI)

Hello everyone.I have a process on databricks when I need to upload a CSV file everyday manually.I would like to know if there is a way to import this data (as panda in python, for example) with no necessary to upload this file everyday manually util...

  • 3745 Views
  • 3 replies
  • 4 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 4 kudos

Autoloader is indeed a valid option,or use of some kind of ETL tool which fetches the file and put it somewhere on your cloud provider, like Azure Data Factory or AWS Glue etc.

  • 4 kudos
2 More Replies
User16869509994
by New Contributor II
  • 1144 Views
  • 1 replies
  • 1 kudos
  • 1144 Views
  • 1 replies
  • 1 kudos
Latest Reply
aladda
Databricks Employee
  • 1 kudos

Can you provide an example of what exactly do you mean? If the reference is to how "Repos" shows up in the UI, that's more for a Ux convenience. Repos as such are designed to be a container for version controlled notebooks that live in the Git reposi...

  • 1 kudos
User16783853501
by Databricks Employee
  • 1092 Views
  • 1 replies
  • 0 kudos

What types of files does autoloader support for streaming ingestion ? I see good support for CSV and JSON, how can I ingest files like XML, avro, parquet etc ? would XML rely on Spark-XML ?

What types of files does autoloader support for streaming ingestion ? I see good support for CSV and JSON, how can I ingest files like XML, avro, parquet etc ? would XML rely on Spark-XML ? 

  • 1092 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Please raise a feature request via ideas portal for XML support in autoloader As a workaround, you could look at reading this with wholeTextFiles (which loads the data into a PairRDD with one record per input file) and parsing it with from_xml from ...

  • 0 kudos
Labels