cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Kaijser
by New Contributor II
  • 1914 Views
  • 1 replies
  • 2 kudos

Installing private python Azure DevOps repository without revealing personal access token in pyproject.toml

I want to install a .whl file on my Databricks cluster which includes a private Azure DevOps repository as a dependency in its pyproject.toml file, i.e.:[project] name = "test" description = "test_description." version = "0.1.0" authors = [ { name ...

  • 1914 Views
  • 1 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Aaron Kaijser​  Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question. Thanks.

  • 2 kudos
Ajay-Pandey
by Esteemed Contributor III
  • 12155 Views
  • 7 replies
  • 11 kudos

Resolved! Unzip Files

Hi all, I am trying to unzip a file in databricks but facing an issue,Please help me if you have any doc or codes to share.

  • 12155 Views
  • 7 replies
  • 11 kudos
Latest Reply
vivek_rawat
New Contributor III
  • 11 kudos

Hey ajay,You can follow this module to unzip your zip file.To give your brief idea about this, it will unzip your file directly into your driver node storage.So If your compressed data is inside DBFS then you first have to move that to drive node and...

  • 11 kudos
6 More Replies
f2008700
by New Contributor III
  • 15665 Views
  • 6 replies
  • 7 kudos

Configuring average parquet file size

I have S3 as a data source containing sample TPC dataset (10G, 100G).I want to convert that into parquet files with an average size of about ~256MiB. What configuration parameter can I use to set that?I also need the data to be partitioned. And withi...

  • 15665 Views
  • 6 replies
  • 7 kudos
Latest Reply
Anonymous
Not applicable
  • 7 kudos

Hi @Vikas Goel​ We haven't heard from you since the last response from @Werner Stinckens​ â€‹, and I was checking back to see if her suggestions helped you.Or else, If you have any solution, please share it with the community, as it can be helpful to o...

  • 7 kudos
5 More Replies
Enzo_Bahrami
by New Contributor III
  • 3154 Views
  • 2 replies
  • 0 kudos

Resolved! Input File Path from Autoloader in Delta Live Tables

Hello everyone!I was wondering if there is any way to get the subdirectories in which the file resides while loading while loading using Autoloader with DLT. For example:def customer(): return (  spark.readStream.format('cloudfiles')    .option('clou...

  • 3154 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Parsa Bahraminejad​ We haven't heard from you since the last response from @Vigneshraja Palaniraj​ â€‹, and I was checking back to see if her suggestions helped you.Or else, If you have any solution, please share it with the community, as it can be...

  • 0 kudos
1 More Replies
cnjrules
by New Contributor III
  • 3095 Views
  • 3 replies
  • 0 kudos

Resolved! Reference file name when using COPY INTO?

When using the COPY INTO statement is it possible to reference the current file name in the select staement? A generic example is shown below, hoping I can log the file name in the target table.COPY INTO my_table FROM (SELECT key, index, textData, ...

  • 3095 Views
  • 3 replies
  • 0 kudos
Latest Reply
cnjrules
New Contributor III
  • 0 kudos

Found the info I was looking for on the page below:https://docs.databricks.com/ingestion/file-metadata-column.html

  • 0 kudos
2 More Replies
nyehia
by Contributor
  • 13789 Views
  • 19 replies
  • 1 kudos

Can not access SQL files in the Shared workspace

Hey,we have an issue in that we can access the SQL files whenever the notebook is in the repo path but whenever the CICD pipeline imports the repo notebooks and SQL files to the shared workspace, we can list the SQL files but can not read them.we cha...

  • 13789 Views
  • 19 replies
  • 1 kudos
Latest Reply
karthik_p
Esteemed Contributor
  • 1 kudos

@Nermin Yehia​ yes, as you are moving files to different location manually , just update as can manage permissions in target and that should take care of everything

  • 1 kudos
18 More Replies
Praveen
by New Contributor II
  • 9569 Views
  • 8 replies
  • 1 kudos

Resolved! Pass Typesafe config file to the Spark Submit Job

Hello everyone ! I am trying to pass a Typesafe config file to the spark submit task and print the details in the config file. Code: import org.slf4j.{Logger, LoggerFactory} import com.typesafe.config.{Config, ConfigFactory} import org.apache.spa...

  • 9569 Views
  • 8 replies
  • 1 kudos
Latest Reply
source2sea
Contributor
  • 1 kudos

I've experenced similar issues; please help to answer how to get this working;I've tried using below to be either /dbfs/mnt/blah path or dbfs:/mnt/blah pathin either spark_submit_task or spark_jar_task (via cluster spark_conf for java optinos); no su...

  • 1 kudos
7 More Replies
andrew0117
by Contributor
  • 5402 Views
  • 4 replies
  • 0 kudos

Resolved! partition on a csv file

When I use SQL code like "create table myTable (column1 string, column2 string) using csv options('delimiter' = ',', 'header' = 'true') location 'pathToCsv'" to create a table from a single CSV file stored in a folder within an Azure Data Lake contai...

  • 5402 Views
  • 4 replies
  • 0 kudos
Latest Reply
pvignesh92
Honored Contributor
  • 0 kudos

Hi @andrew li​, When you specify a path with LOCATION keyword, Spark will consider that to be an EXTERNAL table. So when you dropped the table, you underlying data if any will not be cleared. So in you case, as this is an external table, you folder s...

  • 0 kudos
3 More Replies
andrew0117
by Contributor
  • 5180 Views
  • 6 replies
  • 2 kudos

index a dataframe from a csv file based on the file's original order (not based on any specific column, based on the entire row) using spark

how to guarantee the index is always following the file's original order no matter what. Currently, I'm using val df = spark.read.options(Map("header"-> "true", "inferSchema" -> "true")).csv("filePath").withColumn("index", monotonically_increasing...

  • 5180 Views
  • 6 replies
  • 2 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 2 kudos

monotonically_increasing_id will not as it is to guarantee that every partition has separate ids. What is the whole code? Do you load directory with a lot of CSVs? What "original order" means? Is it csvs ordered by file creation date, by file name? o...

  • 2 kudos
5 More Replies
Sagacious
by New Contributor II
  • 15999 Views
  • 5 replies
  • 0 kudos

How to upload large files to Databricks? and how to unzip files successfully?

I have two JSON files, one ~3 gb and one ~5 gb. I am unable to upload them to databricks community edition as they exceed the max allowed up-loadable file size (~2 gb). If I zip them I am able to upload them, but I am also having issues figuring out ...

  • 15999 Views
  • 5 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Sage Olson​ Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we...

  • 0 kudos
4 More Replies
Tonny_Stark
by New Contributor III
  • 3227 Views
  • 3 replies
  • 0 kudos

FileNotFoundError: [Errno 2] No such file or directory:

I have the following error code in databricks when I want to unzip filesFileNotFoundError: [Errno 2] No such file or directory:  but the file is there I already tried several ways and nothing worksI have tried modifying by placing/dbfs/mnt/dbfs/mnt/d...

error
  • 3227 Views
  • 3 replies
  • 0 kudos
Latest Reply
karthik_p
Esteemed Contributor
  • 0 kudos

@Alfredo Vallejos​ then your file is tar.gz file right, have you tried tar command instead of unzip

  • 0 kudos
2 More Replies
Larrio
by New Contributor III
  • 7029 Views
  • 6 replies
  • 3 kudos

Autoloader - understanding missing file after schema update.

Hello,Concerning Autoloader (based on https://docs.databricks.com/ingestion/auto-loader/schema.html), so far what I understand is when it detects a schema update, the stream fails and I have to rerun it to make it works, it's ok.But once I rerun it, ...

  • 7029 Views
  • 6 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

Hi @Lucien Arrio​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Thank...

  • 3 kudos
5 More Replies
uv
by New Contributor II
  • 5995 Views
  • 3 replies
  • 2 kudos

Parquet to csv delta file

Hi Team, I have a parquet file in s3 bucket which is a delta file I am able to read it but I am unable to write it as a csv file.​getting the following error when i am trying to write:​A transaction log for Databricks Delta was found at `s3://path/a...

  • 5995 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @yuvesh kotiala​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...

  • 2 kudos
2 More Replies
Akshith_Rajesh
by New Contributor III
  • 2462 Views
  • 4 replies
  • 1 kudos

Does DataBricks lock the file in Adls Gen 2 before writing (Append) to a file If yes then how can we fetch the file is locked

I have a requirement , I am running 2 Notebooks parallelly I want to overwrite the file parallelly .If 2 Notebooks Try to overwrite the file at the same time , will I lose the data because of overwriting the file at the same time .I want to overwr...

  • 2462 Views
  • 4 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @Rajesh Akshith​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...

  • 1 kudos
3 More Replies
Galdino
by New Contributor II
  • 4833 Views
  • 3 replies
  • 1 kudos

How to read a json from BytesIO with PySpark?

I want read a json from IO variable using PySpark.My code using pandas:io = BytesIO()ftp.retrbinary('RETR '+ file_name, io.write)io.seek(0)# With pandasdf = pd.read_json(io)What I tried using PySpark, but don't work: io = BytesIO() ftp.retrbinary('...

  • 4833 Views
  • 3 replies
  • 1 kudos
Latest Reply
Erik_L
Contributor II
  • 1 kudos

Just use pandas and follow with spark.createDataFrame(df)

  • 1 kudos
2 More Replies
Labels