cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

reading multiple csv files using pathos.multiprocessing

Prototype998
New Contributor III

I'm using PySpark and Pathos to read numerous CSV files and create many DF, but I keep getting this problem.

dbx_errorcode for the same:-

from pathos.multiprocessing import ProcessingPool

def readCsv(path):

  return spark.read.csv(path,header=True)

csv_file_list = [file[0][5:] for file in dbutils.fs.ls("/databricks-datasets/COVID/coronavirusdataset/") if file[1].endswith(".csv")]

pool = ProcessingPool(2)

results = pool.map(readCsv, csv_file_list)

1 ACCEPTED SOLUTION

Accepted Solutions

hey @Punit Chauhan​ refer this code

from multiprocessing.pool import ThreadPool
pool = ThreadPool(5)
notebooks = ['dim_1', 'dim_2']
pool.map(lambda path: dbutils.notebook.run("/Test/Threading/"+path, timeout_seconds= 60, arguments={"input-data": path}),notebooks)

Rishabh Pandey

View solution in original post

5 REPLIES 5

AmanSehgal
Honored Contributor III

You actually don't need to filter `.csv` files like that.

You can use `pathGlobFilter` to do a regex match for selecting files that matches provided regular expression.

df = spark.read.option("pathGlobFilter","*.csv").csv(upload_path)

Vidula
Honored Contributor

Hi @Punit Chauhan​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Prototype998
New Contributor III

@Ajay Pandey​ @Rishabh Pandey​ 

hey @Punit Chauhan​ refer this code

from multiprocessing.pool import ThreadPool
pool = ThreadPool(5)
notebooks = ['dim_1', 'dim_2']
pool.map(lambda path: dbutils.notebook.run("/Test/Threading/"+path, timeout_seconds= 60, arguments={"input-data": path}),notebooks)

Rishabh Pandey

thanks @Rishabh Pandey​ 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group