cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Supreme_Auto_Ci
by New Contributor II
  • 2837 Views
  • 4 replies
  • 4 kudos
  • 2837 Views
  • 4 replies
  • 4 kudos
Latest Reply
rahulroy
New Contributor II
  • 4 kudos

Data science is a multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data. It encompasses the entire data lifecycle, from data acquisition to data exploration, modeling, and...

  • 4 kudos
3 More Replies
Saeid_H
by Contributor
  • 12173 Views
  • 7 replies
  • 8 kudos

What are the practical advantage of Feature Store compared to Delta Lake?

Could someone explain the practical advantages of using a feature store vs. Delta Lake. apparently they both work in the same manner and the feature store does not provide additional value. However, based on the documentation on the databricks page, ...

  • 12173 Views
  • 7 replies
  • 8 kudos
Latest Reply
Anonymous
Not applicable
  • 8 kudos

Hi @Saeid Hedayati​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answer...

  • 8 kudos
6 More Replies
rendorHaevyn
by New Contributor III
  • 2321 Views
  • 4 replies
  • 0 kudos

Resolved! History of code executed on Data Science & Engineering service clusters

I want to be able to view a listing of any or all of the following:When Notebooks were attached / detached to and from a DS&E clusterWhen Notebook code was executed on a DS&E clusterWhat Notebook specific cell code was executed on a DS&E clusterIs th...

  • 2321 Views
  • 4 replies
  • 0 kudos
Latest Reply
Atanu
Databricks Employee
  • 0 kudos

From the UI https://docs.databricks.com/notebooks/notebooks-code.html#version-control best way to check is version control.BTW, do you see this helps https://www.databricks.com/blog/2022/11/02/monitoring-notebook-command-logs-static-analysis-tools.ht...

  • 0 kudos
3 More Replies
anvil
by New Contributor II
  • 889 Views
  • 1 replies
  • 0 kudos

How far does model size and lag impact distributed inference ?

Hello !I was wondering how impactful a model's size of inference lag was in a distributed manner.With tools like Pandas Iterator UDFs or mlflow.pyfunc.spark_udf() we can make it so models are loaded only once per worker, so I would tend to say that m...

  • 889 Views
  • 1 replies
  • 0 kudos
Latest Reply
youssefmrini
Databricks Employee
  • 0 kudos

Your assumption that minimizing inference lag is more important than minimizing the size of the model in a distributed setting is generally correct.In a distributed environment, models are typically loaded once per worker, as you mentioned, which mea...

  • 0 kudos
anvil
by New Contributor II
  • 2496 Views
  • 3 replies
  • 4 kudos

Are UDFs necessary for applying models from ML libraries at scale ?

Hello,I recently finished the "scalable machine learning with apache spark" course and saw that SKLearn models could be applied faster in a distributed manner when used in pandas UDFs or with mapInPandas() method. Spark MLlib models don't need this k...

  • 2496 Views
  • 3 replies
  • 4 kudos
Latest Reply
Manoj12421
Valued Contributor II
  • 4 kudos

MlLib is in the maintenance model and udf is not used by creating model in most cases

  • 4 kudos
2 More Replies
jonathan-dufaul
by Valued Contributor
  • 1556 Views
  • 1 replies
  • 0 kudos

how does the data science workflow change in databricks if you start with a nosql database (specifically document store) instead of something more traditional/rdbms type source?

I'm sorry if this is a bad question. The tl;dr is are there any concrete examples of a nosql data science workflows specifically in databricks and if so what are they?is it always the case that our end goal is a dataframe?For us we start as a bunch o...

  • 1556 Views
  • 1 replies
  • 0 kudos
Latest Reply
Nhan_Nguyen
Valued Contributor
  • 0 kudos

Nice sharing, thanks!

  • 0 kudos
jdigiovanni
by New Contributor
  • 1374 Views
  • 3 replies
  • 0 kudos

EOFError trying to assign a model using a custom module

I'm in a Data Science Bootcamp, and the final case study includes data preprocessing (done), using a linear regression model on the data, then porting to SQL for visualization. The model build uses custom python code provided as part of the exercise....

  • 1374 Views
  • 3 replies
  • 0 kudos
Latest Reply
Vidula
Honored Contributor
  • 0 kudos

Hi @Joe DiGiovanni​ Just wanted to check in if you were able to resolve your issue or do you need more help? We'd love to hear from you.Thanks!

  • 0 kudos
2 More Replies
Dhara
by New Contributor III
  • 17734 Views
  • 9 replies
  • 5 kudos

Access multiple .mdb files using Python

Hi, I wanted to access multiple .mdb access files which are stored in the Azure Data Lake Storage(ADLS) or on Databricks File System using Python. Is it possible to guide me how can I achieve it? It would be great if you can share some code snippets ...

  • 17734 Views
  • 9 replies
  • 5 kudos
Latest Reply
User16764241763
Honored Contributor
  • 5 kudos

@Dhara Mandal​ Can you please try below?# cmd 1 %pip instal pandas_access   # cmd 2 import pandas_access as mdb   db_filename = '/dbfs/FileStore/Campaign_Template.mdb'   # Listing the tables. for tbl in mdb.list_tables(db_filename): print(tbl)   ...

  • 5 kudos
8 More Replies
User16752240150
by New Contributor II
  • 3139 Views
  • 1 replies
  • 0 kudos

What's the best way to implement long term data versioning?

I'm a data scientist creating versioned ML models. For compliance reasons, I need to be able to replicate the training data for each model version. I've seen that you can version datasets by using delta, but the default retention period is around 30 ...

  • 3139 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Delta, as you mentioned has a feature to do time travel and by default, delta tables retain the commit history for 30 days. Operations on history of the table are parallel but will become more expensive as the log size increasesNow, in this case - s...

  • 0 kudos
User16752239203
by Databricks Employee
  • 993 Views
  • 1 replies
  • 0 kudos

How can I use Non- Spark related libraries like spacy with Databricks and Spark

I have an NLP application that I build on my local machine using spacy and pandas, but now I would like to scale my application to a large production dataset and utilize the benefits of sparks distributed compute. How do I import and utilize a librar...

  • 993 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

It depends on what you mean, but if you're just trying to (say) tokenize and process data with spacy in parallel, then that's trivial. Write a 'pandas UDF' function that expresses how you want to transform data using spacy, in terms of a pandas DataF...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 552 Views
  • 0 replies
  • 0 kudos

Databricks Certified Professional Data Scientist  Does this exam require Databricks-specific or Spark-specific knowledge?No. Test-takers will be asse...

Databricks Certified Professional Data Scientist Does this exam require Databricks-specific or Spark-specific knowledge?No. Test-takers will be assessed on their understanding of the basics of machine learning and data science, how to complete each ...

  • 552 Views
  • 0 replies
  • 0 kudos
Labels