- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-13-2022 11:51 PM
I'm using PySpark and Pathos to read numerous CSV files and create many DF, but I keep getting this problem.
code for the same:-
from pathos.multiprocessing import ProcessingPool
def readCsv(path):
return spark.read.csv(path,header=True)
csv_file_list = [file[0][5:] for file in dbutils.fs.ls("/databricks-datasets/COVID/coronavirusdataset/") if file[1].endswith(".csv")]
pool = ProcessingPool(2)
results = pool.map(readCsv, csv_file_list)
- Labels:
-
Multiple
-
Multiprocessing
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-22-2022 01:34 AM
hey @Punit Chauhan refer this code
from multiprocessing.pool import ThreadPool
pool = ThreadPool(5)
notebooks = ['dim_1', 'dim_2']
pool.map(lambda path: dbutils.notebook.run("/Test/Threading/"+path, timeout_seconds= 60, arguments={"input-data": path}),notebooks)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-14-2022 02:41 AM
You actually don't need to filter `.csv` files like that.
You can use `pathGlobFilter` to do a regex match for selecting files that matches provided regular expression.
df = spark.read.option("pathGlobFilter","*.csv").csv(upload_path)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-04-2022 12:00 AM
Hi @Punit Chauhan
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-22-2022 01:30 AM
@Ajay Pandey @Rishabh Pandey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-22-2022 01:34 AM
hey @Punit Chauhan refer this code
from multiprocessing.pool import ThreadPool
pool = ThreadPool(5)
notebooks = ['dim_1', 'dim_2']
pool.map(lambda path: dbutils.notebook.run("/Test/Threading/"+path, timeout_seconds= 60, arguments={"input-data": path}),notebooks)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-22-2022 01:35 AM
thanks @Rishabh Pandey

