Hello friends,
I have a DataFrame with specific values. I am trying to find specific values out of it.
*I/P -
|ID | text |
|:--|:------|
|1 | select distinct Col1 as OrderID from Table1 WHERE ( (Col3 Like '%ABC%') OR (Col3 Like '%DEF%') OR (Col3 Like '%EFG%'))|
|2 | select distinct Col1 as OrderID from Table1 WHERE (Col2 = 1234) AND( (Col3 Like '%XYZ%') OR (Col3 Like '%PQR%'))|
|3 | select distinct Col1 as OrderID from Table1 WHERE ( (Col3 Like '%MNO%') OR (Col3 Like '%PQR%'))|
|4 | select distinct Col1 as OrderID, 1 as Duration, 1 as Load from Table1 WHERE (Col3 Like '%PQR%')|
|5 | select distinct Col1 as OrderID from Table1 WHERE ( (Col4 = 'AA') OR (Col4 = 'BB') OR (Col4 = 'CC') OR (Col4 = 'DD')) AND (Col3 Like '%XYZ%')|
|6 | select distinct Col1 as OrderID from Table1 WHERE (Col1 = 1234)|
**O/P -
|ID | text_codes |
|:--|:------------|
|1 | ['ABC', 'DEF', 'EFG'] |
|2 | ['XYZ', 'PQR'] |
|3 | ['MNO', 'PQR'] |
|4 | ['PQR'] |
|5 | ['XYZ'] |
|6 | [] |
steps I have tried -
I got indexes of each code based on "Col3" keyword plus specific number of words, so intermediate output is
* O/P -
|ID | codes_idx |
|:--|:----------|
|1 | 71;94;117 |
|2 | 88;111 |
|3 | 71;94 |
|4 | 95 |
|5 | 141 |
|6 | |
Code for it -
def toString(x):
l = list(x)
ll = [str(i) for i in l]
return ';'.join(ll)
def getCodes(str1):
findStr = '(Col3 Like '
res = [i+14 for i in range(len(str1)) if str1.startswith(findStr, i) ]
return toString(res)
result = df2_query1.rdd.map(lambda x: (x[0], getCodes(x[0]) ) )
result.collect()
df_result = spark.createDataFrame(result)
But when I tried to get substring function applied here.
res_list = []
for i in range(len(str1)):
idxi = str1.startswith(findStr, i)
res_str = substring(str1,idxi+14,3)
res_list.append(res_str)
I get Attribute Error -
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 64.0 failed 4 times, most recent failure: Lost task 0.3 in stage 64.0 (TID 118) (10.139.64.5 executor 1): org.apache.spark.api.python.PythonException: 'AttributeError: 'NoneType' object has no attribute '_jvm'', from <command-1021210642233982>, line 9. Full traceback below:
I am not sure what I am doing wrong here. Any help is appreciated.
Thanks,
Tejas