Intro
With the continuous improvement of Apache Spark, especially the SQL engine and emergence of related projects such as Zeppelin notebooks we are starting to get the data analysis functionality we had on single machine setups using RDBMS and data analysis libraries like Pandas.
- Complex where clauses
- Sorting a dataset
- Joining a dataset
- Self joins
- Grouping the data
Test 1: Complex where clauses
Expectations:
This is a case where Spark should shine as it can concurrently read a dataset from multiple machines.
Query:
select * from df where Voertuigsoort like ‘%Personenauto%’ and (Merk like ‘MERCEDES-BENZ’ or Merk like ‘BMW’ ) and Inrichting like ‘%hatchback%’
+-------+--------+-------------+------------+ | | pandas | spark local | spark yarn | +-------+--------+-------------+------------+ | 100k | 0.002 | 1.71 | 10.278 | | 1m | 0.006 | 4.91 | 16.219 | | 5m | 0.191 | 21.94 | 69.947 | | 10m | 0.12 | 43.845 | 137.986 | | 13.5m | fail | 64.819 | 188.339 | +-------+--------+-------------+------------+
Comment:
Pandas was abviously blazing fast for small datasets where Spark struggled because of the underlying complexity.
Test 2: Sorting the dataset
Expectations:
This is usually a very demanding task and there is a reason why ETL developers recommend not to do sorting in memory (eg. SSIS) but only in the database engine.
Query:
select * from df order by 'Datum tenaamstelling'
+-------+---------+-------------+------------+ | | pandas | spark local | spark yarn | +-------+---------+-------------+------------+ | 100k | 0.278 | 12.66 | 13.048 | | 1m | 4.502 | fail | fail | | 5m | 49.631 | fail | fail | | 10m | 115.372 | fail | fail | | 13.5m | fail | fail | fail | +-------+---------+-------------+------------+
Comment:
I was quite disappointed with Spark in both modes.
Test 3: Joining the dataset
Expectations:
This is another taxing task which is not as brute force as sorting but can be done efficiently using a good query optimizer. Also the fact that the join is on a string column will certainly degrade performance.
Query:
select * from df join join_df on df.Inrichting = join_df.type
+-------+--------+-------------+------------+ | | pandas | spark local | spark yarn | +-------+--------+-------------+------------+ | 100k | 0.148 | 8.04 | 20.797 | | 1m | 1.507 | fail | fail | | 5m | 7.973 | fail | fail | | 10m | 14.72 | fail | fail | | 13.5m | fail | fail | fail | +-------+--------+-------------+------------+
Comment:
I saw other blogpost detailing similar problem. The performance is disappointing but I hope Spark manages to fix this issue in the future. Spark >= 2.0 supposedly shows great improvements in this aspect.
Test 4: Self joins
Expectations:
This will be even more demanding as we are not joining to a table with 100 rows.
Query:
select * from df join df on df.Kenteken = df.Kenteken
+-------+--------+-------------+------------+ | | pandas | spark local | spark yarn | +-------+--------+-------------+------------+ | 100k | 0.32 | fail | fail | | 1m | 3.652 | fail | fail | | 5m | 19.928 | fail | fail | | 10m | 38.207 | fail | fail | | 13.5m | fail | fail | fail | +-------+--------+-------------+------------+
Comment:
Same as for Test 3.
Test 5: Grouping the data
Expectations:
This is a task Apache Spark should perform well in as it can be efficiently ran as Map-Reduce tesk.
Query:
select count(*) from df group by Merk
+-------+--------+-------------+------------+ | | pandas | spark local | spark yarn | +-------+--------+-------------+------------+ | 100k | 0.014 | 4.767 | 6.925 | | 1m | 0.118 | 5.35 | 15.998 | | 5m | 0.651 | 26.212 | 64.879 | | 10m | 1.243 | 47.805 | 130.908 | | 13.5m | fail | 73.143 | 171.963 | +-------+--------+-------------+------------+
Comment:
Everything went as expected.
Interesting numbers, did you get a chance to re-run your tests with Spark 2.x? Actually I might try this tomorrow when I find some time to do this.
LikeLike