Try to search your question here, if you can't find : Ask Any Question Now ?

Parallelizing record linkage process in python

HomeCategory: stackoverflowParallelizing record linkage process in python
Avatarrupesh asked 5 months ago

I have been working on two large census datasets in order to link them and need help badly. As the size of these datasets is quite large I need to parallelize my record linking code. I am using python record linkage toolkit to compute the feature vectors of record pairs using the blocking and then comparison function which also includes a custom comparison function for the Marital status column. So far, I have looked for many ways but I could not apply those methods successfully(even provided the number of jobs argument in comparison function). Here is my code(I have not included all of the columns just a few)

 def featureVectors():

    df1==pd.read_csv('deduplicatedfinal71.csv',encoding='utf8')
    df2=pd.read_csv('deduplicatednew81.csv',encoding='utf8',delimiter=',')
    pcl = recordlinkage.index.Block(left_on= 
             ['PR_NAME_SURN','PR_NAME_GN'],right_on=['SNAMLAST','SNAMFRST'])
    pairs= pcl.index(df1,df2)
    compare_cl = recordlinkage.Compare(n_jobs=16)

    compare_cl.string('PR_NAME_GN', 'SNAMFRST', 
                      method='jarowinkler',threshold=0.80,label='FirstJW')
    compare_cl.string('PR_NAME_SURN', 'SNAMLAST', 
                      method='jarowinkler',threshold=0.80,label='LastJW')
    compare_cl.string('PR_NAME_SURN','SNAMLAST',
                     method='levenshtein',threshold=0.80,label='LastNEditD')
    compare_cl.string('PR_NAME_GN','SNAMFRST',
                    method='levenshtein',threshold=0.80,label='FirstNEditD')

    def marital_compare(s1,s2):
        concat=pd.concat([s1,s2],axis=1, ignore_index=True)

        def inner_apply(x):
            val1=x[0]
            val2=x[1]
            if (val1==6 and val2==1):
                return(1)
            elif (val1==1 and (val2==4 or val2== 5)):
                return(1)
            elif(val1==val2):
                return(1)

            else:
                return(0)

        return concat.apply(inner_apply, axis=1)

   compare_cl.compare_vectorized(marital_compare,'MARITAL_STATUS','MARST',
                                 label='MARITAL_STATUS')

   features = compare_cl.compute(pairs, df1,df2)
   features=pd.DataFrame(features)
   features.to_csv('featureVector.csv',index=False)
1 Answers
Best Answer
AvatarMannu answered 5 months ago
Your Answer

14 + 6 =

Popular Tags

WP Facebook Auto Publish Powered By : XYZScripts.com