Data Engineering

Forum Posts

Sorted by:

by nikhilkumawat • New Contributor III

04-27-2023 6:37:46 AM

11880 Views
10 replies
8 kudos

Resolved! Get file information while using "Trigger jobs when new files arrive" https://docs.databricks.com/workflows/jobs/file-arrival-triggers.html

I am currently trying to use this feature of "Trigger jobs when new file arrive" in one of my project. I have an s3 bucket in which files are arriving on random days. So I created a job to and set the trigger to "file arrival" type. And within the no...

Data Engineering

11880 Views
10 replies
8 kudos

04-27-2023 6:37:46 AM

View Replies

Latest Reply

elguitar
New Contributor III

10-29-2024 11:55:10 PM

8 kudos

I spent some time configuring a setup similar to this. Unfortunately, there's no simple way to do this. There's only {{job.trigger.file_arrival.location}} parameter, but that is pretty much useless, since it points to the directory that we are watchi...

8 kudos

10-29-2024 11:55:10 PM

9 More Replies

by Bilal1 • New Contributor III

02-16-2022 10:37:25 PM

29958 Views
7 replies
2 kudos

Resolved! Simply writing a dataframe to a CSV file (non-partitioned)

When writing a dataframe in Pyspark to a CSV file, a folder is created and a partitioned CSV file is created. I have then rename this file in order to distribute it my end user.Is there any way I can simply write my data to a CSV file, with the name ...

Data Engineering

29958 Views
7 replies
2 kudos

02-16-2022 10:37:25 PM

View Replies

Latest Reply

chris0706
New Contributor II

10-04-2024 10:36:57 AM

2 kudos

I know this post is a little old, but Chat GPT actually put together a very clean and straightforward solution for me (in scala): // Set the temporary output directory and the desired final file pathval tempDir = "/tmp/your_file_name"val finalOutputP...

2 kudos

10-04-2024 10:36:57 AM

6 More Replies

by gianni77 • New Contributor

07-29-2015 4:19:38 AM

58261 Views
13 replies
4 kudos

How can I export a result of a SQL query from a databricks notebook?

The "Download CSV" button in the notebook seems to work only for results <=1000 entries. How can I export larger result-sets as CSV?

Data Engineering

58261 Views
13 replies
4 kudos

07-29-2015 4:19:38 AM

View Replies

Latest Reply

igorstar
New Contributor III

05-29-2024 12:14:02 PM

4 kudos

If you have a large dataset, you might want to export it to a bucket in parquet format from your notebook:%python df = spark.sql("select * from your_table_name") df.write.parquet(your_s3_path)

4 kudos

05-29-2024 12:14:02 PM

12 More Replies

by sp1 • New Contributor II

01-15-2023 10:02:07 PM

13932 Views
5 replies
4 kudos

Resolved! Pass date value as parameter in Databricks SQL notebook

I want to pass yesterday date (In the example 20230115*.csv) in the csv file. Don't know how to create parameter and use it here.CREATE OR REPLACE TEMPORARY VIEW abc_delivery_logUSING CSVOPTIONS ( header="true", delimiter=",", inferSchema="true", pat...

Data Engineering

13932 Views
5 replies
4 kudos

01-15-2023 10:02:07 PM

View Replies

Latest Reply

Asifpanjwani
New Contributor II

04-26-2024 1:58:25 PM

4 kudos

@Retired_mod @sp1 @Chaitanya_Raju @daniel_sahal Hi Everyone,I need the same scenario on SQL code, because my DBR cluster not allowed me to run python codeError: Unsupported cell during execution. SQL warehouses only support executing SQL cells.I appr...

4 kudos

04-26-2024 1:58:25 PM

4 More Replies

by yubin-apollo • New Contributor II

11-30-2022 10:26:14 AM

3252 Views
4 replies
0 kudos

COPY INTO skipRows FORMAT_OPTIONS does not work

Based on the COPY INTO documentation, it seems I can use `skipRows` to skip the first `n` rows. I am trying to load a CSV file where I need to skip a few first rows in the file. I have tried various combinations, e.g. setting header parameter on or ...

Data Engineering

3252 Views
4 replies
0 kudos

11-30-2022 10:26:14 AM

View Replies

Latest Reply

karthik-kobai
New Contributor II

03-26-2024 6:29:52 AM

0 kudos

@yubin-apollo: My bad - I had the skipRows in the COPY_OPTIONS and not in the FORMAT_OPTIONS. It works, please ignore my previous comment. Thanks

0 kudos

03-26-2024 6:29:52 AM

3 More Replies

by prapot • New Contributor II

02-14-2022 9:48:50 PM

8492 Views
2 replies
3 kudos

Resolved! How to write a Spark DataFrame to CSV file with our .CRC in Azure Databricks?

val spark:SparkSession = SparkSession.builder() .master("local[3]") .appName("SparkByExamples.com") .getOrCreate()//Spark Read CSV Fileval df = spark.read.option("header",true).csv("address.csv")//Write DataFrame to address directorydf.write...

Data Engineering

8492 Views
2 replies
3 kudos

02-14-2022 9:48:50 PM

View Replies

Latest Reply

Nw2this
New Contributor II

01-08-2024 6:09:53 PM

3 kudos

Will your csv have the name prefix 'part-' or can you name it whatever you like?

3 kudos

01-08-2024 6:09:53 PM

1 More Replies

by Michael42 • New Contributor III

06-05-2023 2:57:41 PM

17422 Views
4 replies
7 kudos

Resolved! Want to load a high volume of CSV rows in the fastest way possible (in excess of 5 billion rows). I want the best approach, in terms of speed, for loading into the bronze table.

My source can only deliver CSV format (pipe delimited).My source has the ability to generate multiple CSV files and transfer them to a single upload folder.All rows must go to the same target bronze delta table.I do not care about the order in which ...

Data Engineering

17422 Views
4 replies
7 kudos

06-05-2023 2:57:41 PM

View Replies

Latest Reply

Anonymous
Not applicable

06-14-2023 12:07:59 AM

7 kudos

Hi @Michael Popp Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers ...

7 kudos

06-14-2023 12:07:59 AM

3 More Replies

by System1999 • New Contributor III

04-10-2023 10:46:30 AM

5676 Views
7 replies
0 kudos

My 'Data' menu item shows 'No Options' for Databases. How can I fix?

Hi, I'm new to Databricks and I've signed up for the Community edition.First, I've noticed that I cannot return to a previously created cluster, as I get the message telling me that restarting a cluster is not available to me. Ok, inconvenient, but I...

Data Engineering

5676 Views
7 replies
0 kudos

04-10-2023 10:46:30 AM

View Replies

Latest Reply

System1999
New Contributor III

04-19-2023 6:42:01 AM

0 kudos

Hi @Suteja Kanuri ,I get the error message under Data before I've created a cluster. Then I still get it when I've created a cluster and a notebook (having attached the notebook to the cluster). Thanks.

0 kudos

04-19-2023 6:42:01 AM

6 More Replies

by Tim_T • New Contributor

04-14-2023 6:53:49 AM

1046 Views
0 replies
0 kudos

Are training/ecommerce data tables available as CSVs?

The course "Apache Spark™ Programming with Databricks" requires data sources such as training/ecommerce/events/events.parquet. Are these available as CSV files? My company's databricks configuration does not allow me to mount to such repositories, bu...

Data Engineering

1046 Views
0 replies
0 kudos

04-14-2023 6:53:49 AM

by MRTN • New Contributor III

04-04-2023 12:22:03 PM

10987 Views
4 replies
3 kudos

Load CSV files with slightly different schemas

I have a set of CSV files generated by a system, where the schema has evolved over the years. Some columns have been added, and at least one column has been renamed in newer files. Is there any way to elegantly load these files into a dataframe? I ha...

Data Engineering

10987 Views
4 replies
3 kudos

04-04-2023 12:22:03 PM

View Replies

Latest Reply

MRTN
New Contributor III

04-12-2023 1:08:17 AM

3 kudos

For reference - for anybody struggling with the same issues. All online examples using auto loader are written as one block statement on the form: (spark.readStream.format("cloudFiles") .option("cloudFiles.format", "csv") # The schema location di...

3 kudos

04-12-2023 1:08:17 AM

3 More Replies

by Tracy_ • New Contributor II

01-31-2023 9:16:53 PM

10889 Views
5 replies
0 kudos

Incorrect reading csv format with inferSchema

Hi All,There is a CSV with a column ID (format: 8-digits & "D" at the end).When trying to read a csv with .option("inferSchema", "true"), it returns the ID as double and trim the "D". Is there any idea (apart from inferSchema=False) to get correct ...

Data Engineering

10889 Views
5 replies
0 kudos

01-31-2023 9:16:53 PM

View Replies

Latest Reply

Anonymous
Not applicable

04-10-2023 9:52:21 PM

0 kudos

Hi @tracy ng Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your...

0 kudos

04-10-2023 9:52:21 PM

4 More Replies

by tech2cloud • New Contributor II

02-11-2023 6:20:28 AM

2321 Views
2 replies
0 kudos

Databricks Autoloader streamReader does not include the partition column as part of output.

I have folder structure at source such as/transaction/date_=2023-01-20/hr_=02/tras01.csv/transaction/date_=2023-01-20/hr_=03/tras02.csvWhere 'date_' and 'hr_' are my partitions and present in the dataset as well. But the streamReader does not read th...

Data Engineering

2321 Views
2 replies
0 kudos

02-11-2023 6:20:28 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-10-2023 3:15:19 AM

0 kudos

Hi @Ravi Vishwakarma Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answ...

0 kudos

04-10-2023 3:15:19 AM

1 More Replies

by Chhaya • New Contributor III

03-09-2023 12:05:22 AM

4364 Views
6 replies
2 kudos

Using great expectations with autolaoder

Hi everyone ,I have implemented a data pipeline using autoloader bronze-->silver-->gold .now while I do this I want to perform some data quality checks , and for that I'm using great expectations library.However I'm stuck with below error when trying...

Data Engineering

4364 Views
6 replies
2 kudos

03-09-2023 12:05:22 AM

View Replies

Latest Reply

Anonymous
Not applicable

03-31-2023 5:18:15 PM

2 kudos

Hi @Chhaya Vishwakarma Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your fe...

2 kudos

03-31-2023 5:18:15 PM

5 More Replies

by kll • New Contributor III

03-30-2023 10:37:53 AM

5878 Views
1 replies
1 kudos

Resolved! OSError: Invalid argument when attempting to save a pandas dataframe to csv

I am attempting to save a pandas DataFrame to as csv to a directory I created in Databricks workspace or in the `cwd`. import pandas as pd import os df.to_csv("data.csv", index=False) df.to_csv(str(os.getcwd()) + "/data.csv", index=False) ...

Data Engineering

5878 Views
1 replies
1 kudos

03-30-2023 10:37:53 AM

View Replies

Latest Reply

Ajay-Pandey
Esteemed Contributor III

03-31-2023 4:27:40 AM

1 kudos

Hi @Keval Shah ,You can save your dataframe to csv in dbfs storage.Please refer below code that might help you-df = pd.read_csv(StringIO(data), sep=',') #print(df) df.to_csv('/dbfs/FileStore/ajay/file1.txt')

1 kudos

03-31-2023 4:27:40 AM

by rammy • Contributor III

12-02-2022 10:29:02 AM

2555 Views
2 replies
3 kudos

How can we save a data frame in Docx format using pyspark?

I am trying to save a data frame into a document but it returns saying that the below errorjava.lang.ClassNotFoundException: Failed to find data source: docx. Please find packages at http://spark.apache.org/third-party-projects.htm #f_d...

Data Engineering

2555 Views
2 replies
3 kudos

12-02-2022 10:29:02 AM

View Replies

Latest Reply

jose_gonzalez
Databricks Employee

01-25-2023 3:36:58 PM

3 kudos

Hi,You cannot do it from Pyspark, but you can try to use Pandas to save to Excell. There is no Docx

3 kudos

01-25-2023 3:36:58 PM

1 More Replies