Advanced data analysis for cBioPortal

As part of my application for the cBioPortal Google Summer of Code position I made this Jupyter notebook in which I demonstrated how different studies (RDBMS equivalent of a table) could be clustered together into types of studies based on the attributes measured (column names and contents of columns).

The goal of the project was to develop a clustering algorithm that would recognize and cluster all of the studies into different study types. We can also state it as: developing a Fuzzy Matching algorithm in order to further normalize the database.

In the Jupyter notebook I did the following:

  1. First I parsed data from a REST API providing data in JSON format into pandas dataframes.
  2. Then I showed graphically the results of clustering of studies based on names measured attributes (names of columns)
  3. Lastly, I implemented similarity scoring functions (Fuzzy Matching) to score the similarity of studies based on values of attributes (columns in a table) and graphically plotted those results in a heatmap using Seaborn package.


You can take a look at result of the process here:





About dorianbg

A Data Engineer based in London, United Kingdom
This entry was posted in Data Engineering, Data science, Python. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s