Category Archives: Data science

Advanced data analysis for cBioPortal

As part of my application for the cBioPortal Google Summer of Code position I made this Jupyter notebook in which I demonstrated how different studies (RDBMS equivalent of a table) could be clustered together into types of studies based on the attributes … Continue reading

Posted in Data Engineering, Data science, Python | Leave a comment

Get Spark Clasifier metrics using the Confusion Matrix

When using Apache Spark specifically for “binary” classification (ie. the labels are either 0 or 1) it is possible to use the Confusion Matrix for getting some metrics such as the number of true positives, false negatives or true negatives. … Continue reading

Posted in Data Engineering, Data science | Leave a comment

How to fix ‘Task not serializable’ issues in Apache Spark

When using the RDD API, you can write Map functions which can serve as complex closures.  Because each Map function is executed in parallel on one of the executors, the functionality inside the Map phase, (ie. the code) is sent … Continue reading

Posted in Data Engineering, Data science | Leave a comment

How get into top 30% of House Prices: Advanced Regression Kaggle competition with 50 lines of code

This will be a quick guide on a very easy way to do machine learning competitions. This approach is very general and easily applicable to other competitions. I just want to make it clear that using this approach you will … Continue reading

Posted in Data science | Leave a comment

Quick Keras (with TensorFlow backend) installation on Macbook Pro using conda

1. Prerequisites You should have an Anaconda package manager (in this case I used python 3.4)To install tensorflow conda install -c conda-forge tensorflow 2. Check that tensorflow was installed Start a python session with python and check that tensorflow is … Continue reading

Posted in Data science | Leave a comment