cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

Ougagagoubu
by New Contributor
  • 956 Views
  • 0 replies
  • 0 kudos

FileBug in DBFS? Can not remove file (table) nor create it in Apache Spark (TM) SQL for Data Analysts Coursera course from Unit 6.2 onwards on.

Hello,as the title already suggests, i'm not able to remove a file via the shell (%sh rm -f "path") nor continue the notebook 6.2 onwards on (6.3 etc...) inside DataBricks. I'm using the DataBricks Community edition.While the error message is clear:"...

  • 956 Views
  • 0 replies
  • 0 kudos
hoopla
by New Contributor II
  • 5271 Views
  • 3 replies
  • 1 kudos

Unable to copy mutiple files from file:/tmp to dbfs:/tmp

I am downloading multiple files by web scraping and by default they are stored in /tmp I can copy a single file by providing the filename and path %fs cp file:/tmp/2020-12-14_listings.csv.gz dbfs:/tmp but when I try to copy multiple files I get an ...

  • 5271 Views
  • 3 replies
  • 1 kudos
Latest Reply
hoopla
New Contributor II
  • 1 kudos

Thanks DeepakThis is what I have suspected.Hopefully the wild card feature might be available in futureThanks

  • 1 kudos
2 More Replies
User16826992724
by New Contributor III
  • 992 Views
  • 1 replies
  • 2 kudos
  • 992 Views
  • 1 replies
  • 2 kudos
Latest Reply
User16826992724
New Contributor III
  • 2 kudos

Just like B-tree indices in the traditional EDW world, Z-order indexing can be used on high-cardinality columns like Primary Key columns and high-cardinality joins like facts and dimension tables joins. Z-order indexes can be created only on the ...

  • 2 kudos
User16826992724
by New Contributor III
  • 829 Views
  • 1 replies
  • 4 kudos
  • 829 Views
  • 1 replies
  • 4 kudos
Latest Reply
User16826992724
New Contributor III
  • 4 kudos

There are various methods like using uuid , monotonically_increasing_id(), using row_number() OVER (ORDER BY NULL) AS SK, using md5() or sha() hashing functions etc. Detailed discussion of various options and the pros/cons can be found in this youtu...

  • 4 kudos
morganmazouchi
by New Contributor III
  • 5412 Views
  • 7 replies
  • 4 kudos
  • 5412 Views
  • 7 replies
  • 4 kudos
Latest Reply
Sebastian
Contributor
  • 4 kudos

one way to manage is make the cluster permission only to can restart and then use an init script to install libraries on start up so that users wont install libraries on the fly.

  • 4 kudos
6 More Replies
BeardyMan
by New Contributor III
  • 3987 Views
  • 9 replies
  • 3 kudos

Resolved! MLFlow Serve Logging

When using Azure Databricks and serving a model, we have received requests to capture additional logging. In some instances, they would like to capture input and output or even some of the steps from a pipeline. Is there any way we can extend the lo...

  • 3987 Views
  • 9 replies
  • 3 kudos
Latest Reply
Dan_Z
Honored Contributor
  • 3 kudos

Another word from a Databricks employee:"""You can use the custom model approach but configuring it is painful. Plus you have ended every loggable model in the custom model. Another less intrusive solution would be to have a proxy server do the loggi...

  • 3 kudos
8 More Replies
saipujari_spark
by Valued Contributor
  • 1024 Views
  • 1 replies
  • 3 kudos

Delta Optimized Write vs Reparation, Which is recommended?

When streaming to a Delta table, both repartitioning on the partition column and optimized write can help to avoid small files.Which is recommended between Delta Optimized Write vs Repartitioning?

  • 1024 Views
  • 1 replies
  • 3 kudos
Latest Reply
saipujari_spark
Valued Contributor
  • 3 kudos

 Optimized write is recommended over repartitioning for the below reasons.* The key part of Optimized Writes is that it is an adaptive shuffle. If you have a streaming ingest use case and input data rates change over time, the adaptive shuffle will a...

  • 3 kudos
Artem_Yevtushen
by New Contributor III
  • 1090 Views
  • 0 replies
  • 2 kudos

Accelerating row-wise Python UDF functions without using Pandas UDF ProblemSpark will not automatically parallelize UDF operations on smaller/medium d...

Accelerating row-wise Python UDF functions without using Pandas UDFProblemSpark will not automatically parallelize UDF operations on smaller/medium dataframes. As a result, spark will process the UDF as a single non parallelized task. For row-wise op...

  • 1090 Views
  • 0 replies
  • 2 kudos
Jack
by New Contributor II
  • 1677 Views
  • 1 replies
  • 0 kudos

Resolved! Creating Pandas Data Frame of Features After Applying Variance Reduction

I am building a classification model using the following data frame of 120,000 records (sample of 5 records shown):Using this data, I have built the following model:from sklearn.model_selection import train_test_split from sklearn.feature_extraction....

df df3
  • 1677 Views
  • 1 replies
  • 0 kudos
Latest Reply
Dan_Z
Honored Contributor
  • 0 kudos

This is more of a scikit-learn question than a Databricks question. But poking around I think VT_reduced.get_support() is probably what you are looking for:https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold....

  • 0 kudos
Celia
by New Contributor II
  • 1406 Views
  • 2 replies
  • 1 kudos

how to include a third-party Maven package in MLflow model serving job cluster in Azure Databricks

We try to use MLflow Model Serving, this service will enable realtime model serving behind a REST API interface; it will launch a single-node cluster that will host our model. The issue happens when the single-node cluster try to get the environment...

  • 1406 Views
  • 2 replies
  • 1 kudos
Latest Reply
BeardyMan
New Contributor III
  • 1 kudos

Unfortunately we came across this same issue. We were trying to use MLFlow Serve to produce an API that could take text input and pass it through some NLP. In this instance we had installed a maven package on the cluster, so the experiment would run ...

  • 1 kudos
1 More Replies
Anonymous
by Not applicable
  • 1429 Views
  • 3 replies
  • 19 kudos

Resolved! Welcome back! Please introduce yourself to the community. :)

Hello everyone! My name is Piper and I'm one of the community moderators for Databricks. I'd like to take this opportunity to welcome you to the new Databricks community! I'd also like to ask you to introduce yourself in this thread. We are here to h...

Colorful sign showing the world welcome in different languages.
  • 1429 Views
  • 3 replies
  • 19 kudos
Latest Reply
cconnell
Contributor II
  • 19 kudos

I work mostly with health and medical data, on a contract or project basis. I am located in Bedford MA and Ogunquit Maine. I formerly worked at Blue Metal / Insight, which is where I got my start on Databricks.Languages -- Python, PySpark, Koalashttp...

  • 19 kudos
2 More Replies
manugarri
by New Contributor II
  • 8462 Views
  • 10 replies
  • 1 kudos

Fuzzy text matching in Spark

I have a list of client provided data, a list of company names. I have to match those names with an internal database of company names. The client list can fit in memory (its about 10k elements) but the internal dataset is on hdfs and we use Spark ...

  • 8462 Views
  • 10 replies
  • 1 kudos
Latest Reply
Sonal
New Contributor II
  • 1 kudos

You can use Zingg: Spark based open source tool for this https://github.com/zinggAI/zingg

  • 1 kudos
9 More Replies
saniafatimi
by New Contributor II
  • 1004 Views
  • 2 replies
  • 1 kudos

Need guidance on migrating power bi reports to databricks

Hi All, I want to import an existing database/tables (say AdventureWorks) to databricks. And after importing tables, I want to develop reports on top I need guidance on this. Can someone give me resources that could help me in doing things end to en...

  • 1004 Views
  • 2 replies
  • 1 kudos
Latest Reply
Chris_Shehu
Valued Contributor III
  • 1 kudos

@sania fatimi​  There are several different ways to do this and it's really going to depend on what your current need is. You could for example load the data into the databricks delta lake and use the databricks powerbi connecter to query the data fr...

  • 1 kudos
1 More Replies
User16830818524
by New Contributor II
  • 1327 Views
  • 3 replies
  • 0 kudos

Resolved! Libraries in Databricks Runtimes

Is it possible to easily determine what libraries and which version are included in a specific DBR Version?

  • 1327 Views
  • 3 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hello. My name is Piper and I'm one of the community moderators. One of the team members sent this information to me.This should be the correct path to check libraries installed with DBRs.https://docs.databricks.com/release-notes/runtime/8.3ml.html?_...

  • 0 kudos
2 More Replies
Rodrigo_Brandet
by New Contributor
  • 2813 Views
  • 3 replies
  • 4 kudos

Resolved! Upload CSV files on Databricks by code (note UI)

Hello everyone.I have a process on databricks when I need to upload a CSV file everyday manually.I would like to know if there is a way to import this data (as panda in python, for example) with no necessary to upload this file everyday manually util...

  • 2813 Views
  • 3 replies
  • 4 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 4 kudos

Autoloader is indeed a valid option,or use of some kind of ETL tool which fetches the file and put it somewhere on your cloud provider, like Azure Data Factory or AWS Glue etc.

  • 4 kudos
2 More Replies
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels