cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Raagavi
by New Contributor
  • 2409 Views
  • 1 replies
  • 1 kudos

Is there a way to read the CSV files automatically from on-premises network locations and write back to the same from Databricks?

Is there a way to read the CSV files automatically from on-premises network locations and write back to the same from Databricks? 

  • 2409 Views
  • 1 replies
  • 1 kudos
Latest Reply
Debayan
Databricks Employee
  • 1 kudos

Hi @Raagavi Rajagopal​ , you can access files on mounted object storage (just an example) or files, please refer: https://docs.databricks.com/files/index.html#access-files-on-mounted-object-storageAnd in the DBFS , CSV files can be read and write fr...

  • 1 kudos
Dave_Nithio
by Contributor
  • 2054 Views
  • 3 replies
  • 0 kudos

Resolved! Data Engineering with Databricks Module 6.3L Error: Autoload CSV

I am currently taking the Data Engineering with Databricks course and have run into an error. I have also attempted this with my own data and had a similar error. In the lab, we are using autoloader to read a spark stream of csv files saved in the DB...

  • 2054 Views
  • 3 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

As a small aside, you don't need the third argument in the structfields

  • 0 kudos
2 More Replies
plynton
by New Contributor II
  • 1632 Views
  • 1 replies
  • 2 kudos

Resolved! Dataframe to update subset of fields in table...

I have a table that I'll update with multiple inputs (csv). Is there a simple way to update my target when the source fields won't be a 1:1 match? Another challenge I've run into is that my sources don't have a header field, though I guess I could ...

  • 1632 Views
  • 1 replies
  • 2 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 2 kudos

Read your CSV as a dataframe and than make update using Merge (upsert).

  • 2 kudos
tanjil
by New Contributor III
  • 7010 Views
  • 6 replies
  • 6 kudos

Resolved! Read and transform CSVs in parallel.

I need to read and transform several CSV files and then append them to a single data frame. I am able to do this in databricks using simple for loops, but I would like to speed this up.Below is the rough structure of my code: for filepath in all_file...

  • 7010 Views
  • 6 replies
  • 6 kudos
Latest Reply
Vidula
Honored Contributor
  • 6 kudos

Hi @tanjil​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Thanks!

  • 6 kudos
5 More Replies
data_boy_2022
by New Contributor III
  • 9703 Views
  • 7 replies
  • 3 kudos

Data ingest of csv files from S3 using Autoloader is slow

I have 150k small csv files (~50Mb) stored in S3 which I want to load into a delta table.All CSV files are stored in the following structure in S3:bucket/folder/name_00000000_00000100.csvbucket/folder/name_00000100_00000200.csvThis is the code I use ...

Cluster Metrics SparkUI_DAG SparkUI_Job
  • 9703 Views
  • 7 replies
  • 3 kudos
Latest Reply
Vidula
Honored Contributor
  • 3 kudos

Hi @Jan R​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Thanks!

  • 3 kudos
6 More Replies
Data_Engineer3
by Contributor III
  • 5692 Views
  • 4 replies
  • 1 kudos

Unable to read data from Elasticsearch with spark in Databricks.

When I am trying to read data from elasticsearch by spark sql, it throw an error like RuntimeException: Error while encoding: java.lang.RuntimeException: scala.collection.convert.Wrappers$JListWrapper is not a valid external type for schema of string...

  • 5692 Views
  • 4 replies
  • 1 kudos
Latest Reply
Vidula
Honored Contributor
  • 1 kudos

Hi there @KARTHICK N​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.T...

  • 1 kudos
3 More Replies
DipakBachhav
by New Contributor III
  • 10166 Views
  • 5 replies
  • 1 kudos

How to store SQL query result data on a local disk?

I am a newbie to data bricks and trying to write results into the excel/ CSV file using the below command but getting DataFrame' object has no attribute 'to_csv' errors while executing.I am using a notebook to execute my SQL queries and now want to s...

  • 10166 Views
  • 5 replies
  • 1 kudos
Latest Reply
Vidula
Honored Contributor
  • 1 kudos

Hi there @Dipak Bachhav​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from yo...

  • 1 kudos
4 More Replies
parthibsg
by New Contributor II
  • 1410 Views
  • 1 replies
  • 2 kudos

When to use Dataframes API over Spark SQL

Hello Experts,I am new to Databricks. Building data pipelines, I have both batch and streaming data.Should I use Dataframes API to read csv files then convert to parquet format then do the transformation? orwrite to table using CSV then use Spark SQL...

  • 1410 Views
  • 1 replies
  • 2 kudos
Latest Reply
Debayan
Databricks Employee
  • 2 kudos

Hi Rathinam, It would be better to understand the pipeline more in this situation. Writing to table using CSV and then using spark SQL will be faster in few cases than the other one.

  • 2 kudos
BradSheridan
by Valued Contributor
  • 4125 Views
  • 9 replies
  • 4 kudos

Resolved! How to use cloudFiles to completely overwrite the target

Hey there Community!! I have a client that will produce a CSV file daily that needs to be moved from Bronze -> Silver. Unfortunately, this source file will always be a full set of data....not incremental. I was thinking of using AutoLoader/cloudFil...

  • 4125 Views
  • 9 replies
  • 4 kudos
Latest Reply
BradSheridan
Valued Contributor
  • 4 kudos

I "up voted'" all of @werners suggestions b/c they are all very valid ways of addressing my need (the true power/flexibility of the Databricks UDAP!!!). However, turns out I'm going to end up getting incremental data afterall :). So now the flow wi...

  • 4 kudos
8 More Replies
ASN
by New Contributor II
  • 13163 Views
  • 5 replies
  • 2 kudos

Python Read csv - Don't consider comma when its within the quotes, even if the quotes are not immediate to the separator

I have data like, below and when reading as CSV, I don't want to consider comma when its within the quotes even if the quotes are not immediate to the separator (like record #2). 1 and 3 records are good if we use separator, but failing on 2nd record...

Input and expected Output
  • 13163 Views
  • 5 replies
  • 2 kudos
Latest Reply
Pholo
Contributor
  • 2 kudos

Hi, I think you can use this option for the csvReadeespark.read.options(header = True, sep = ",", unescapedQuoteHandling = "BACK_TO_DELIMITER").csv("your_file.csv")especially the unescapedQuoteHandling. You can search for the other options at this l...

  • 2 kudos
4 More Replies
Herkimer
by New Contributor II
  • 1048 Views
  • 0 replies
  • 0 kudos

intermittent connection error

I am running dbsqlcli in windows 10. I have put together the attached cmd file to pull the identity column data from a series of our tables into individual CSVs so I can upload then to a PostgreSQL DB to do a comparison of each table to those in the ...

  • 1048 Views
  • 0 replies
  • 0 kudos
Shay
by New Contributor III
  • 6769 Views
  • 8 replies
  • 6 kudos

Resolved! How do you Upload TXT and CSV files into Shared Workspace in Databricks?

I try to upload the needed files under the right directory of the project to work.The files are zipped first as that is an accepted format. I have a Python project which requires the TXT and CSV format files as they are called and used via .py files ...

  • 6769 Views
  • 8 replies
  • 6 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 6 kudos

@Shay Alam​, can you share the code with which you read the files? Apparently python interprets the file format as a language, so it seems like some options are not filled in correctly.

  • 6 kudos
7 More Replies
Ambi
by New Contributor III
  • 4654 Views
  • 5 replies
  • 8 kudos

Resolved! Access azure storage account from databricks notebook using pyspark or SQL

I have a storage account - Azure BLOB StorageThere I had container. Inside the container we had a CSV file. Couldn't read the file using the access Key and Storage account name.Any idea how to read file using PySpark/SQL? Thanks in advance

  • 4654 Views
  • 5 replies
  • 8 kudos
Latest Reply
Atanu
Databricks Employee
  • 8 kudos

@Ambiga D​ you need to mount the storage https://docs.databricks.com/data/data-sources/azure/azure-storage.html#mount-azure-blob-storage-containers-to-dbfs you can follow this,thanks.

  • 8 kudos
4 More Replies
klllmmm
by New Contributor II
  • 4310 Views
  • 3 replies
  • 1 kudos

Error as no such file when reading CSV file using pandas

I'm trying to read a CSV file saved in data using pandas read_csv function. But it gives No such file error.%fs ls /FileStore/tables/   df= pd.read_csv('/dbfs/FileStore/tables/CREDIT_1.CSV')     df= pd.read_csv('/dbfs:/FileStore/tables/CREDIT_1.CSV')...

image
  • 4310 Views
  • 3 replies
  • 1 kudos
Latest Reply
klllmmm
New Contributor II
  • 1 kudos

Thanks to @Werner Stinckens​ for the answer.I understood that I have to use spark to read data from clusters.

  • 1 kudos
2 More Replies
Labels