cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

draculla1208
by New Contributor
  • 1259 Views
  • 0 replies
  • 0 kudos

Able to read .hdf files but not able to write to .hdf files from worker nodes and save to dbfs

I have a set of .hdf files that I want to distribute and read on Worker nodes under Databricks environment using PySpark. I am able to read .hdf files on worker nodes and get the data from the files. The next requirement is that now each worker node ...

  • 1259 Views
  • 0 replies
  • 0 kudos
THIAM_HUATTAN
by Valued Contributor
  • 4045 Views
  • 3 replies
  • 3 kudos

Using R, how do we write csv file to say dbfs:/tmp?

let us say I already have the data 'TotalData'write.csv(TotalData,file='/tmp/TotalData.csv',row.names = FALSE)I do not see any error from abovewhen I list files below:%fs ls /tmpI do not see any files written there. Why?

  • 4045 Views
  • 3 replies
  • 3 kudos
Latest Reply
Cedric
Databricks Employee
  • 3 kudos

Hi Thiam,Thank you for reaching out to us. In this case it seems that you have written a file to the OS /tmp and tried to fetch the same folder in DBFS.Written >> /tmp/TotalData.csvReading >> /dbfs/tmp/TotalData.csvPlease try to execute write.csv wit...

  • 3 kudos
2 More Replies
Erik
by Valued Contributor III
  • 2822 Views
  • 2 replies
  • 2 kudos

Resolved! Can we have the powerbi connector step into "hive_metastore" automatically?

We are distributing pbids files providing the connection info to databricks. It contains options passed to the "Databricks.Catalogs " function implementing the connection to databricks. It is my understanding that databricks has made this together wi...

  • 2822 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Erik Parmann​ Does @Hubert Dudek​  response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?We'd love to hear from you.Thanks!

  • 2 kudos
1 More Replies
data_boy_2022
by New Contributor III
  • 14205 Views
  • 7 replies
  • 3 kudos

Data ingest of csv files from S3 using Autoloader is slow

I have 150k small csv files (~50Mb) stored in S3 which I want to load into a delta table.All CSV files are stored in the following structure in S3:bucket/folder/name_00000000_00000100.csvbucket/folder/name_00000100_00000200.csvThis is the code I use ...

Cluster Metrics SparkUI_DAG SparkUI_Job
  • 14205 Views
  • 7 replies
  • 3 kudos
Latest Reply
Vidula
Honored Contributor
  • 3 kudos

Hi @Jan R​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Thanks!

  • 3 kudos
6 More Replies
weldermartins
by Honored Contributor
  • 3933 Views
  • 5 replies
  • 13 kudos

Hello everyone, I have a directory with 40 files. File names are divided into prefixes. I need to rename the prefix k3241 according to the name in the...

Hello everyone, I have a directory with 40 files.File names are divided into prefixes. I need to rename the prefix k3241 according to the name in the last prefix.I even managed to insert the csv extension at the end of the file. but renaming files ba...

Template
  • 3933 Views
  • 5 replies
  • 13 kudos
Latest Reply
Anonymous
Not applicable
  • 13 kudos

Hi @welder martins​ How are you doing?Thank you for posting that question. We are glad you could resolve the issue. Would you want to mark an answer as the best solution?Cheers

  • 13 kudos
4 More Replies
Constantine
by Contributor III
  • 1775 Views
  • 2 replies
  • 3 kudos

Resolved! Can't view files of different types in databricks

I am reading a Kafka input using Spark Streaming on databricks and trying to deserialize it. The input is in the form of thrift. I want to create a file of .thrift format to provide schema but am unable to do it. Even if I create the file locally and...

  • 1775 Views
  • 2 replies
  • 3 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 3 kudos

Hi @John Constantine​ ,Just checking if you still need help or not anymore. If you do, please share as much details and logs as possible, so we would be able to help better.

  • 3 kudos
1 More Replies
StephanieAlba
by Databricks Employee
  • 2778 Views
  • 1 replies
  • 6 kudos

Resolved! Is it possible to use Autoloader with a daily update file structure?

We get new files from a third-p@rty each day. The files could be the same or different. However, each day all csv files arrive in the same dated folder. Is it possible to use autoloader on this structure?We want each csv file to be a table that gets ...

The folders In the folders
  • 2778 Views
  • 1 replies
  • 6 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 6 kudos

@Stephanie Rivera​ , You can use pathGlobfilter, but you will need a separate autoloader for which type of file.df_alert = spark.readStream.format("cloudFiles") \.option("cloudFiles.format", "binaryFile") \.option("pathGlobfilter", alert.csv") \.load...

  • 6 kudos
Direo
by Contributor II
  • 12839 Views
  • 2 replies
  • 3 kudos
  • 12839 Views
  • 2 replies
  • 3 kudos
Latest Reply
User16873043212
New Contributor III
  • 3 kudos

@Direo Direo​ , Yeah, this is a location inside your dbfs. The whole control is on you. Databricks do not delete something you keep in this location.

  • 3 kudos
1 More Replies
tomsyouruncle
by New Contributor III
  • 20033 Views
  • 14 replies
  • 3 kudos

How do I enable support for arbitrary files in Databricks Repos? Public Preview feature doesn't appear in admin console.

"Arbitrary files in Databricks Repos", allowing not just notebooks to be added to repos, is in Public Preview. I've tried to activate it following the instructions in the above link but the option doesn't appear in Admin Console. Minimum requirements...

image repos
  • 20033 Views
  • 14 replies
  • 3 kudos
Latest Reply
kahing_cheung
Databricks Employee
  • 3 kudos

What environment is your deployment in?

  • 3 kudos
13 More Replies
al_joe
by Contributor
  • 4139 Views
  • 2 replies
  • 0 kudos

Where / how does DBFS store files?

I tried to use %fs head to print the contents of a CSV file used in a training%fs head "/mnt/path/file.csv"but got an error saying cannot head a directory!?Then I did %fs ls on the same CSV file and got a list of 4 files under a directory named as a ...

screenshot image
  • 4139 Views
  • 2 replies
  • 0 kudos
Latest Reply
User16753725182
Databricks Employee
  • 0 kudos

Hi @Al Jo​ , are you still seeing the error while printing the contents of te CSV file?

  • 0 kudos
1 More Replies
BenzDriver
by New Contributor II
  • 2455 Views
  • 2 replies
  • 1 kudos

Resolved! SQL command FSCK is not found

Hello there,I currently have the problem of deleted files still being in the transaction log when trying to call a delta table. What I found was this statement:%sql FSCK REPAIR TABLE table_name [DRY RUN]But using it returned following error:Error in ...

  • 2455 Views
  • 2 replies
  • 1 kudos
Latest Reply
RKNutalapati
Valued Contributor
  • 1 kudos

Remove square brackets and try executing the command%sqlFSCK REPAIR TABLE table_name DRY RUN

  • 1 kudos
1 More Replies
MichaelO
by New Contributor III
  • 13008 Views
  • 2 replies
  • 2 kudos

Resolved! Transfer files saved in filestore to either the workspace or to a repo

I built a machine learning model:lr = LinearRegression() lr.fit(X_train, y_train)which I can save to the filestore by:filename = "/dbfs/FileStore/lr_model.pkl" with open(filename, 'wb') as f: pickle.dump(lr, f)Ideally, I wanted to save the model ...

  • 13008 Views
  • 2 replies
  • 2 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 2 kudos

Workspace and Repo is not full available via dbfs as they have separate access rights. It is better to use MLFlow for your models as it is like git but for ML. I think using MLOps you can than put your model also to git.

  • 2 kudos
1 More Replies
CleverAnjos
by New Contributor III
  • 6245 Views
  • 5 replies
  • 3 kudos

Resolved! Best way of loading several csv files in a table

What would be the best way of loading several files like in a single table to be consumed?https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-10.csvhttps://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-11.csvhttps://s3.amazonaws...

  • 6245 Views
  • 5 replies
  • 3 kudos
Latest Reply
CleverAnjos
New Contributor III
  • 3 kudos

Thanks Kaniz, I already have the files. I was discussing about the best way to load them

  • 3 kudos
4 More Replies
wyzer
by Contributor II
  • 3480 Views
  • 2 replies
  • 4 kudos

Resolved! How to show the properties of the folders/files from DBFS ?

Hello,How to show the properties of the folders/files from DBFS ?Currently i am using this command :display(dbutils.fs.ls("dbfs:/"))But it only shows :pathnamesizeHow to show these properties ? : CreatedBy (Name)CreatedOn (Date)ModifiedBy (Name)Modi...

  • 3480 Views
  • 2 replies
  • 4 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 4 kudos

Only one idea is to use %sh magic command but there is no name (just root)

  • 4 kudos
1 More Replies
Labels