cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

nikhilkumawat
by New Contributor III
  • 10976 Views
  • 10 replies
  • 8 kudos

Resolved! Get file information while using "Trigger jobs when new files arrive" https://docs.databricks.com/workflows/jobs/file-arrival-triggers.html

I am currently trying to use this feature of "Trigger jobs when new file arrive" in one of my project. I have an s3 bucket in which files are arriving on random days. So I created a job to and set the trigger to "file arrival" type. And within the no...

  • 10976 Views
  • 10 replies
  • 8 kudos
Latest Reply
elguitar
New Contributor III
  • 8 kudos

I spent some time configuring a setup similar to this. Unfortunately, there's no simple way to do this. There's only {{job.trigger.file_arrival.location}} parameter, but that is pretty much useless, since it points to the directory that we are watchi...

  • 8 kudos
9 More Replies
Bilal1
by New Contributor III
  • 28155 Views
  • 7 replies
  • 2 kudos

Resolved! Simply writing a dataframe to a CSV file (non-partitioned)

When writing a dataframe in Pyspark to a CSV file, a folder is created and a partitioned CSV file is created. I have then rename this file in order to distribute it my end user.Is there any way I can simply write my data to a CSV file, with the name ...

  • 28155 Views
  • 7 replies
  • 2 kudos
Latest Reply
chris0706
New Contributor II
  • 2 kudos

I know this post is a little old, but Chat GPT actually put together a very clean and straightforward solution for me (in scala): // Set the temporary output directory and the desired final file pathval tempDir = "/tmp/your_file_name"val finalOutputP...

  • 2 kudos
6 More Replies
gianni77
by New Contributor
  • 56176 Views
  • 13 replies
  • 4 kudos

How can I export a result of a SQL query from a databricks notebook?

The "Download CSV" button in the notebook seems to work only for results <=1000 entries. How can I export larger result-sets as CSV?

  • 56176 Views
  • 13 replies
  • 4 kudos
Latest Reply
igorstar
New Contributor III
  • 4 kudos

If you have a large dataset, you might want to export it to a bucket in parquet format from your notebook:%python df = spark.sql("select * from your_table_name") df.write.parquet(your_s3_path) 

  • 4 kudos
12 More Replies
sp1
by New Contributor II
  • 13548 Views
  • 5 replies
  • 4 kudos

Resolved! Pass date value as parameter in Databricks SQL notebook

I want to pass yesterday date (In the example 20230115*.csv) in the csv file. Don't know how to create parameter and use it here.CREATE OR REPLACE TEMPORARY VIEW abc_delivery_logUSING CSVOPTIONS ( header="true", delimiter=",", inferSchema="true", pat...

  • 13548 Views
  • 5 replies
  • 4 kudos
Latest Reply
Asifpanjwani
New Contributor II
  • 4 kudos

@Retired_mod @sp1 @Chaitanya_Raju @daniel_sahal Hi Everyone,I need the same scenario on SQL code, because my DBR cluster not allowed me to run python codeError: Unsupported cell during execution. SQL warehouses only support executing SQL cells.I appr...

  • 4 kudos
4 More Replies
yubin-apollo
by New Contributor II
  • 3067 Views
  • 4 replies
  • 0 kudos

COPY INTO skipRows FORMAT_OPTIONS does not work

Based on the COPY INTO documentation, it seems I can use `skipRows` to skip the first `n` rows. I am trying to load a CSV file where I need to skip a few first rows in the file. I have tried various combinations, e.g. setting header parameter on or ...

  • 3067 Views
  • 4 replies
  • 0 kudos
Latest Reply
karthik-kobai
New Contributor II
  • 0 kudos

@yubin-apollo: My bad - I had the skipRows in the COPY_OPTIONS and not in the FORMAT_OPTIONS. It works, please ignore my previous comment. Thanks

  • 0 kudos
3 More Replies
prapot
by New Contributor II
  • 8231 Views
  • 2 replies
  • 3 kudos

Resolved! How to write a Spark DataFrame to CSV file with our .CRC in Azure Databricks?

val spark:SparkSession = SparkSession.builder() .master("local[3]") .appName("SparkByExamples.com") .getOrCreate()//Spark Read CSV Fileval df = spark.read.option("header",true).csv("address.csv")//Write DataFrame to address directorydf.write...

  • 8231 Views
  • 2 replies
  • 3 kudos
Latest Reply
Nw2this
New Contributor II
  • 3 kudos

Will your csv have the name prefix 'part-' or can you name it whatever you like?

  • 3 kudos
1 More Replies
Michael42
by New Contributor III
  • 17042 Views
  • 4 replies
  • 7 kudos

Resolved! Want to load a high volume of CSV rows in the fastest way possible (in excess of 5 billion rows). I want the best approach, in terms of speed, for loading into the bronze table.

My source can only deliver CSV format (pipe delimited).My source has the ability to generate multiple CSV files and transfer them to a single upload folder.All rows must go to the same target bronze delta table.I do not care about the order in which ...

  • 17042 Views
  • 4 replies
  • 7 kudos
Latest Reply
Anonymous
Not applicable
  • 7 kudos

Hi @Michael Popp​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers ...

  • 7 kudos
3 More Replies
System1999
by New Contributor III
  • 5434 Views
  • 7 replies
  • 0 kudos

My 'Data' menu item shows 'No Options' for Databases. How can I fix?

Hi, I'm new to Databricks and I've signed up for the Community edition.First, I've noticed that I cannot return to a previously created cluster, as I get the message telling me that restarting a cluster is not available to me. Ok, inconvenient, but I...

error
  • 5434 Views
  • 7 replies
  • 0 kudos
Latest Reply
System1999
New Contributor III
  • 0 kudos

Hi @Suteja Kanuri​ ,I get the error message under Data before I've created a cluster. Then I still get it when I've created a cluster and a notebook (having attached the notebook to the cluster). Thanks.

  • 0 kudos
6 More Replies
Tim_T
by New Contributor
  • 1014 Views
  • 0 replies
  • 0 kudos

Are training/ecommerce data tables available as CSVs?

The course "Apache Sparkâ„¢ Programming with Databricks" requires data sources such as training/ecommerce/events/events.parquet. Are these available as CSV files? My company's databricks configuration does not allow me to mount to such repositories, bu...

  • 1014 Views
  • 0 replies
  • 0 kudos
MRTN
by New Contributor III
  • 10592 Views
  • 4 replies
  • 3 kudos

Load CSV files with slightly different schemas

I have a set of CSV files generated by a system, where the schema has evolved over the years. Some columns have been added, and at least one column has been renamed in newer files. Is there any way to elegantly load these files into a dataframe? I ha...

  • 10592 Views
  • 4 replies
  • 3 kudos
Latest Reply
MRTN
New Contributor III
  • 3 kudos

For reference - for anybody struggling with the same issues. All online examples using auto loader are written as one block statement on the form: (spark.readStream.format("cloudFiles") .option("cloudFiles.format", "csv") # The schema location di...

  • 3 kudos
3 More Replies
Tracy_
by New Contributor II
  • 10198 Views
  • 5 replies
  • 0 kudos

Incorrect reading csv format with inferSchema

Hi All,There is a CSV with a column ID (format: 8-digits & "D" at the end).When trying to read a csv with .option("inferSchema", "true"), it returns the ID as double and trim the "D". Is there any idea (apart from inferSchema=False) to get correct ...

image.png
  • 10198 Views
  • 5 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @tracy ng​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your...

  • 0 kudos
4 More Replies
tech2cloud
by New Contributor II
  • 2231 Views
  • 2 replies
  • 0 kudos

Databricks Autoloader streamReader does not include the partition column as part of output.

I have folder structure at source such as/transaction/date_=2023-01-20/hr_=02/tras01.csv/transaction/date_=2023-01-20/hr_=03/tras02.csvWhere 'date_' and 'hr_' are my partitions and present in the dataset as well. But the streamReader does not read th...

image
  • 2231 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Ravi Vishwakarma​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answ...

  • 0 kudos
1 More Replies
Chhaya
by New Contributor III
  • 4136 Views
  • 6 replies
  • 2 kudos

Using great expectations with autolaoder

Hi everyone ,I have implemented a data pipeline using autoloader bronze-->silver-->gold .now while I do this I want to perform some data quality checks , and for that I'm using great expectations library.However I'm stuck with below error when trying...

  • 4136 Views
  • 6 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Chhaya Vishwakarma​ Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your fe...

  • 2 kudos
5 More Replies
kll
by New Contributor III
  • 5638 Views
  • 1 replies
  • 1 kudos

Resolved! OSError: Invalid argument when attempting to save a pandas dataframe to csv

I am attempting to save a pandas DataFrame to as csv to a directory I created in Databricks workspace or in the `cwd`. import pandas as pd   import os   df.to_csv("data.csv", index=False)   df.to_csv(str(os.getcwd()) + "/data.csv", index=False)      ...

  • 5638 Views
  • 1 replies
  • 1 kudos
Latest Reply
Ajay-Pandey
Esteemed Contributor III
  • 1 kudos

Hi @Keval Shah​ ,You can save your dataframe to csv in dbfs storage.Please refer below code that might help you-df = pd.read_csv(StringIO(data), sep=',') #print(df) df.to_csv('/dbfs/FileStore/ajay/file1.txt')

  • 1 kudos
rammy
by Contributor III
  • 2447 Views
  • 2 replies
  • 3 kudos

How can we save a data frame in Docx format using pyspark?

  I am trying to save a data frame into a document but it returns saying that the below errorjava.lang.ClassNotFoundException: Failed to find data source: docx. Please find packages at http://spark.apache.org/third-party-projects.htm   #f_d...

  • 2447 Views
  • 2 replies
  • 3 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 3 kudos

Hi,You cannot do it from Pyspark, but you can try to use Pandas to save to Excell. There is no Docx

  • 3 kudos
1 More Replies
Labels