cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

acsmaggart
by New Contributor III
  • 6160 Views
  • 6 replies
  • 3 kudos

`collect()`ing Large Datasets in R

Background: I'm working on a pilot project to assess the pros and cons of using DataBricks to train models using R. I am using a dataset that occupies about 5.7GB of memory when loaded into a pandas dataframe. The data are stored in a delta table in ...

collecting the data using pyspark collecting the data using R
  • 6160 Views
  • 6 replies
  • 3 kudos
Latest Reply
Annapurna_Hiriy
Databricks Employee
  • 3 kudos

@acsmaggart Please try using collect_larger() to collect the larger dataset. This should work. Please refer to the following document for more info on the library.https://medium.com/@NotZacDavies/collecting-large-results-with-sparklyr-8256a0370ec6

  • 3 kudos
5 More Replies
Supreme_Auto_Ci
by New Contributor II
  • 3217 Views
  • 4 replies
  • 4 kudos
  • 3217 Views
  • 4 replies
  • 4 kudos
Latest Reply
rahulroy
New Contributor II
  • 4 kudos

Data science is a multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data. It encompasses the entire data lifecycle, from data acquisition to data exploration, modeling, and...

  • 4 kudos
3 More Replies
Priyag1
by Honored Contributor II
  • 1392 Views
  • 1 replies
  • 9 kudos

***Understanding Databricks Machine Learning Workspace - 1***Databricks Machine Learning helps you simplify and standardize your ML development proce...

***Understanding Databricks Machine Learning Workspace - 1***Databricks Machine Learning helps you simplify and standardize your ML development processes. It is helpful to :Train models either manually or with AutoML.Track training parameters and mo...

  • 1392 Views
  • 1 replies
  • 9 kudos
Latest Reply
samhita
New Contributor III
  • 9 kudos

good initiative

  • 9 kudos
Howard_w
by New Contributor
  • 3924 Views
  • 2 replies
  • 1 kudos

Resolved! Study material ML associate certification

Hi, is there an officially recommended book for the machine learning associate/professional certification? Or any sort of study guide or even third party course? I really struggle to find some study material for this activity.

  • 3924 Views
  • 2 replies
  • 1 kudos
Latest Reply
Priyag1
Honored Contributor II
  • 1 kudos

hello, to get an overview you may find out ML certification course from data bricks academy and refer the related concepts

  • 1 kudos
1 More Replies
Koliya
by New Contributor II
  • 23035 Views
  • 4 replies
  • 7 kudos

The Python process exited with exit code 137 (SIGKILL: Killed). This may have been caused by an OOM error. Check your command's memory usage.

I am running a hugging face model on a GPU cluster (g4dn.xlarge, 16GB Memory, 4 cores). I run the same model in four different notebooks with different data sources. I created a workflow to run one model after the other. These notebooks run fine indi...

  • 23035 Views
  • 4 replies
  • 7 kudos
Latest Reply
fkemeth
New Contributor II
  • 7 kudos

You might accumulate gradients when running your Huggingface model, which typically leads to out-of-memory errors after some iterations. If you use it for inference only, dowith torch.no_grad(): # The code where you apply the model

  • 7 kudos
3 More Replies
anvil
by New Contributor II
  • 1102 Views
  • 1 replies
  • 0 kudos

How far does model size and lag impact distributed inference ?

Hello !I was wondering how impactful a model's size of inference lag was in a distributed manner.With tools like Pandas Iterator UDFs or mlflow.pyfunc.spark_udf() we can make it so models are loaded only once per worker, so I would tend to say that m...

  • 1102 Views
  • 1 replies
  • 0 kudos
Latest Reply
youssefmrini
Databricks Employee
  • 0 kudos

Your assumption that minimizing inference lag is more important than minimizing the size of the model in a distributed setting is generally correct.In a distributed environment, models are typically loaded once per worker, as you mentioned, which mea...

  • 0 kudos
anvil
by New Contributor II
  • 3262 Views
  • 3 replies
  • 4 kudos

Are UDFs necessary for applying models from ML libraries at scale ?

Hello,I recently finished the "scalable machine learning with apache spark" course and saw that SKLearn models could be applied faster in a distributed manner when used in pandas UDFs or with mapInPandas() method. Spark MLlib models don't need this k...

  • 3262 Views
  • 3 replies
  • 4 kudos
Latest Reply
Manoj12421
Valued Contributor II
  • 4 kudos

MlLib is in the maintenance model and udf is not used by creating model in most cases

  • 4 kudos
2 More Replies
isaac_gritz
by Databricks Employee
  • 2460 Views
  • 1 replies
  • 4 kudos

Databricks MLOps Best Practices

Where to find the best practices on MLOps on DatabricksWe recommend checking out the Big Book of MLOps for detailed guidance on MLOps best practices on Databricks including reference architectures.For a deep dive on the Databricks Feature store, we r...

  • 2460 Views
  • 1 replies
  • 4 kudos
Latest Reply
sher
Valued Contributor II
  • 4 kudos

you can check here https://docs.databricks.com/machine-learning/mlops/mlops-workflow.html

  • 4 kudos
boyelana
by Contributor III
  • 4622 Views
  • 7 replies
  • 13 kudos

Resolved! Getting started with Databricks Machine Learning

hello all,I am fairly new to Databricks technologies and I have taken the Lakehouse Fundamentals course but I am interested in Machine Learning technologies. I will appreciate any help with materials and curated free study paths and packs that can he...

  • 4622 Views
  • 7 replies
  • 13 kudos
Latest Reply
Anonymous
Not applicable
  • 13 kudos

https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf is a free book and has some machine learning examples. The way I learned was mostly from the docs, which are good and have good coding examples.

  • 13 kudos
6 More Replies
User16752245767
by Contributor
  • 2211 Views
  • 3 replies
  • 10 kudos

I am Avi, a Solutions Architect at Databricks. We have built an application to demonstrate how AI-capabilities could be easily integrated to deliver n...

I am Avi, a Solutions Architect at Databricks. We have built an application to demonstrate how AI-capabilities could be easily integrated to deliver novel user experiences. The application allows users to submit images and text, and uses these inputs...

  • 2211 Views
  • 3 replies
  • 10 kudos
Latest Reply
Ajay-Pandey
Esteemed Contributor III
  • 10 kudos

Hi @Avinash Sooriyarachchi​ Thanks for sharing it.

  • 10 kudos
2 More Replies
User16752245767
by Contributor
  • 1231 Views
  • 0 replies
  • 5 kudos

youtu.be

I'm Avi, a Solutions Architect at Databricks working at the intersection of Data Engineering and Machine Learning.Streaming data processing has moved from niche to mainstream, and deploying machine learning models in such data streams opens up a mult...

  • 1231 Views
  • 0 replies
  • 5 kudos
THIAM_HUATTAN
by Valued Contributor
  • 3053 Views
  • 7 replies
  • 6 kudos

Why this Databricks ML code gets stuck?

I could not paste the code here because of the some word not allowed, so I have to paste it elsewhere.Below is OK:https://justpaste.it/8xcr9But below gets stuck:https://justpaste.it/8nydtand it keeps looping and running...

  • 3053 Views
  • 7 replies
  • 6 kudos
Latest Reply
Vidula
Honored Contributor
  • 6 kudos

Hey @THIAM HUAT TAN​ Hope all is well! Just wanted to check in if you were able to resolve your issue, and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you....

  • 6 kudos
6 More Replies
Slalom_Tobias
by New Contributor III
  • 3351 Views
  • 4 replies
  • 1 kudos

Resolved! ML Practioner | ml 09 - automl notebook | error on importing databricks.automl

executing the following code...from databricks import automlsummary = automl.regress(train_df, target_col="price", primary_metric="rmse", timeout_minutes=5, max_trials=10)generates the error...ImportError: cannot import name 'automl' from 'databricks...

  • 3351 Views
  • 4 replies
  • 1 kudos
Latest Reply
Krueger156
New Contributor II
  • 1 kudos

I'm happy to see a particularly subject.

  • 1 kudos
3 More Replies
vivoedoardo
by New Contributor II
  • 2804 Views
  • 3 replies
  • 1 kudos

How to track features used and filters in MLFlow?

Hello everyone,We are experimenting with several approaches in a Machine Learning project ( binary classification), and we would like to keep track of those using MLFlow. We are using the feature store to build, store, and retrieve the features, and ...

  • 2804 Views
  • 3 replies
  • 1 kudos
Latest Reply
NathanielN
New Contributor II
  • 1 kudos

 Thanks for the information, I will try to figure it out for more. Keep sharing such informative post keep suggesting such post.

  • 1 kudos
2 More Replies
Vik1
by New Contributor II
  • 4332 Views
  • 4 replies
  • 2 kudos

Resolved! Cluster setup for ML work for Pandas in Spark, and vanilla Python.

My setup:Worker type: Standard_D32d_v4, 128 GB Memory, 32 Cores, Min Workers: 2, Max Workers: 8Driver type: Standard_D32ds_v4, 128 GB Memory, 32 CoresDatabricks Runtime Version: 10.2 ML (includes Apache Spark 3.2.0, Scala 2.12)I ran a snowflake quer...

  • 4332 Views
  • 4 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hey there @Vivek Ranjan​ Checking in. If Joseph's answer helped, would you let us know and mark the answer as best?  It would be really helpful for the other members to find the solution more quickly.Thanks!

  • 2 kudos
3 More Replies
self-employed
by Contributor
  • 2171 Views
  • 1 replies
  • 3 kudos

Resolved! Is the machine learning part of "Apache Sparkâ„¢ Tutorial: Getting Started with Apache Spark on Databricks" missing or no longer available?

I am following the Apache Sparkâ„¢ Tutorial. When I finish the data set part and want to continue the machine learning part. I found the page is empty. The next section after machine learning is fine. So I guess there must be a url mismatching.The url ...

  • 2171 Views
  • 1 replies
  • 3 kudos
Latest Reply
self-employed
Contributor
  • 3 kudos

I clean the cookie and then the link recovers. So it is an issue about cookie.

  • 3 kudos
Joseph_B
by Databricks Employee
  • 1372 Views
  • 0 replies
  • 1 kudos

mlflow.org

2021-09 webinar: Automating the ML Lifecycle With Databricks Machine Learning (Post 2 of 2)Thank you to everyone who joined! You can access the on-demand recording here and the code in this Github repo.We're sharing a subset of the questions asked an...

  • 1372 Views
  • 0 replies
  • 1 kudos
Joseph_B
by Databricks Employee
  • 989 Views
  • 0 replies
  • 1 kudos

docs.databricks.com

2021-09 webinar: Automating the ML Lifecycle With Databricks Machine Learning (post 1 of 2)Thank you to everyone who joined the Automating the ML Lifecycle With Databricks Machine Learning webinar! You can access the on-demand recording here and the ...

  • 989 Views
  • 0 replies
  • 1 kudos
User16752240150
by New Contributor II
  • 3943 Views
  • 1 replies
  • 0 kudos

What's the best way to implement long term data versioning?

I'm a data scientist creating versioned ML models. For compliance reasons, I need to be able to replicate the training data for each model version. I've seen that you can version datasets by using delta, but the default retention period is around 30 ...

  • 3943 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Delta, as you mentioned has a feature to do time travel and by default, delta tables retain the commit history for 30 days. Operations on history of the table are parallel but will become more expensive as the log size increasesNow, in this case - s...

  • 0 kudos
User16752239203
by Databricks Employee
  • 1284 Views
  • 1 replies
  • 0 kudos

How can I use Non- Spark related libraries like spacy with Databricks and Spark

I have an NLP application that I build on my local machine using spacy and pandas, but now I would like to scale my application to a large production dataset and utilize the benefits of sparks distributed compute. How do I import and utilize a librar...

  • 1284 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

It depends on what you mean, but if you're just trying to (say) tokenize and process data with spacy in parallel, then that's trivial. Write a 'pandas UDF' function that expresses how you want to transform data using spacy, in terms of a pandas DataF...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 757 Views
  • 0 replies
  • 0 kudos

Databricks Certified Professional Data Scientist  Does this exam require Databricks-specific or Spark-specific knowledge?No. Test-takers will be asse...

Databricks Certified Professional Data Scientist Does this exam require Databricks-specific or Spark-specific knowledge?No. Test-takers will be assessed on their understanding of the basics of machine learning and data science, how to complete each ...

  • 757 Views
  • 0 replies
  • 0 kudos
Joseph_B
by Databricks Employee
  • 2700 Views
  • 1 replies
  • 1 kudos
  • 2700 Views
  • 1 replies
  • 1 kudos
Latest Reply
Joseph_B
Databricks Employee
  • 1 kudos

You can find a lot more info on this at this MLflow product page, including a comparison table at the bottom. I'd summarize that comparison as: Databricks provides three key things in its managed MLflow service.Security: MLflow experiments, models, ...

  • 1 kudos
Joseph_B
by Databricks Employee
  • 3346 Views
  • 1 replies
  • 0 kudos
  • 3346 Views
  • 1 replies
  • 0 kudos
Latest Reply
Joseph_B
Databricks Employee
  • 0 kudos

You can find the MLflow version in the runtime release notes, along with a list of every other library provided. E.g., for DBR 8.3 ML, you can look at the release notes for AWS, Azure, or GCP.The MLflow client API (i.e., the API provided by installi...

  • 0 kudos
User16788317466
by Databricks Employee
  • 1915 Views
  • 2 replies
  • 0 kudos

How do I efficiently read image data for a deep learning model?

How do I efficiently read image data for a deep learning model?

  • 1915 Views
  • 2 replies
  • 0 kudos
Latest Reply
Joseph_B
Databricks Employee
  • 0 kudos

Our documentation provides nice examples of preparing image data for training and inference.Training: See docs for AWS, Azure, GCPInference: See reference solution for AWS, Azure, GCP

  • 0 kudos
1 More Replies
Labels