Try to search your question here, if you can't find : Ask Any Question Now ?

Do Spark/Parquet partitions maintain ordering?

HomeCategory: stackoverflowDo Spark/Parquet partitions maintain ordering?
Avatarjohn asked 4 months ago

If I partition a data set, will it be in the correct order when I read it back? For example, consider the following pyspark code:

# read a csv
df = sql_context.read.csv(input_filename)

# add a hash column
hash_udf = udf(lambda customer_id: hash(customer_id) % 4, IntegerType())
df = df.withColumn('hash', hash_udf(df['customer_id']))

# write out to parquet
df.write.parquet(output_path, partitionBy=['hash'])

# read back the file
df2 = sql_context.read.parquet(output_path)

I am partitioning on a customer_id bucket. When I read back the whole data set, are the partitions guaranteed to be merged back together in the original insertion order?

Right now, I’m not so sure, so I’m adding a sequence column:

df = df.withColumn('seq', monotonically_increasing_id())

However, I don’t know if this is redundant.

1 Answers
Best Answer
AvatarJyoti answered 4 months ago
Your Answer

12 + 18 =

Popular Tags

WP Facebook Auto Publish Powered By : XYZScripts.com