- 13561 Views
- 7 replies
- 8 kudos
Could someone explain the practical advantages of using a feature store vs. Delta Lake. apparently they both work in the same manner and the feature store does not provide additional value. However, based on the documentation on the databricks page, ...
- 13561 Views
- 7 replies
- 8 kudos
Latest Reply
Hi @Saeid Hedayati Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answer...
6 More Replies
- 2895 Views
- 4 replies
- 0 kudos
I want to be able to view a listing of any or all of the following:When Notebooks were attached / detached to and from a DS&E clusterWhen Notebook code was executed on a DS&E clusterWhat Notebook specific cell code was executed on a DS&E clusterIs th...
- 2895 Views
- 4 replies
- 0 kudos
Latest Reply
Atanu
Databricks Employee
From the UI https://docs.databricks.com/notebooks/notebooks-code.html#version-control best way to check is version control.BTW, do you see this helps https://www.databricks.com/blog/2022/11/02/monitoring-notebook-command-logs-static-analysis-tools.ht...
3 More Replies
by
anvil
• New Contributor II
- 1049 Views
- 1 replies
- 0 kudos
Hello !I was wondering how impactful a model's size of inference lag was in a distributed manner.With tools like Pandas Iterator UDFs or mlflow.pyfunc.spark_udf() we can make it so models are loaded only once per worker, so I would tend to say that m...
- 1049 Views
- 1 replies
- 0 kudos
Latest Reply
Your assumption that minimizing inference lag is more important than minimizing the size of the model in a distributed setting is generally correct.In a distributed environment, models are typically loaded once per worker, as you mentioned, which mea...
by
anvil
• New Contributor II
- 3185 Views
- 3 replies
- 4 kudos
Hello,I recently finished the "scalable machine learning with apache spark" course and saw that SKLearn models could be applied faster in a distributed manner when used in pandas UDFs or with mapInPandas() method. Spark MLlib models don't need this k...
- 3185 Views
- 3 replies
- 4 kudos
Latest Reply
MlLib is in the maintenance model and udf is not used by creating model in most cases
2 More Replies
- 1994 Views
- 1 replies
- 0 kudos
I'm sorry if this is a bad question. The tl;dr is are there any concrete examples of a nosql data science workflows specifically in databricks and if so what are they?is it always the case that our end goal is a dataframe?For us we start as a bunch o...
- 1994 Views
- 1 replies
- 0 kudos
- 1646 Views
- 3 replies
- 0 kudos
I'm in a Data Science Bootcamp, and the final case study includes data preprocessing (done), using a linear regression model on the data, then porting to SQL for visualization. The model build uses custom python code provided as part of the exercise....
- 1646 Views
- 3 replies
- 0 kudos
Latest Reply
Hi @Joe DiGiovanni Just wanted to check in if you were able to resolve your issue or do you need more help? We'd love to hear from you.Thanks!
2 More Replies
by
Dhara
• New Contributor III
- 21172 Views
- 9 replies
- 5 kudos
Hi, I wanted to access multiple .mdb access files which are stored in the Azure Data Lake Storage(ADLS) or on Databricks File System using Python. Is it possible to guide me how can I achieve it? It would be great if you can share some code snippets ...
- 21172 Views
- 9 replies
- 5 kudos
Latest Reply
@Dhara Mandal Can you please try below?# cmd 1
%pip instal pandas_access
# cmd 2
import pandas_access as mdb
db_filename = '/dbfs/FileStore/Campaign_Template.mdb'
# Listing the tables.
for tbl in mdb.list_tables(db_filename):
print(tbl)
...
8 More Replies
- 3836 Views
- 1 replies
- 0 kudos
I'm a data scientist creating versioned ML models. For compliance reasons, I need to be able to replicate the training data for each model version. I've seen that you can version datasets by using delta, but the default retention period is around 30 ...
- 3836 Views
- 1 replies
- 0 kudos
Latest Reply
Delta, as you mentioned has a feature to do time travel and by default, delta tables retain the commit history for 30 days. Operations on history of the table are parallel but will become more expensive as the log size increasesNow, in this case - s...
- 1250 Views
- 1 replies
- 0 kudos
I have an NLP application that I build on my local machine using spacy and pandas, but now I would like to scale my application to a large production dataset and utilize the benefits of sparks distributed compute. How do I import and utilize a librar...
- 1250 Views
- 1 replies
- 0 kudos
Latest Reply
It depends on what you mean, but if you're just trying to (say) tokenize and process data with spacy in parallel, then that's trivial. Write a 'pandas UDF' function that expresses how you want to transform data using spacy, in terms of a pandas DataF...
- 739 Views
- 0 replies
- 0 kudos
Databricks Certified Professional Data Scientist Does this exam require Databricks-specific or Spark-specific knowledge?No. Test-takers will be assessed on their understanding of the basics of machine learning and data science, how to complete each ...
- 739 Views
- 0 replies
- 0 kudos