Dr. Franziska Horn

Data_ Product_ Strategy_

personal projects & open source

I'm passionate about writing clean and efficient code and like to give back to the community via open source libraries.

PubVis

PubVis is a web app meant to help scientists with their literature research. Instead of having to search for a specific topic, the landscape of published research can be explored visually and papers similar in content to an article of interest are just a click away. A demo of the app is running here (with PubMed articles about different cancer types) and here (with arXiv articles about machine learning). Further details on the implementation can be found in the corresponding paper.

Classify Me! Why?

To make machine learning algorithm decisions more transparent, we can use Layer-wise Relevance Propagation (LRP) to visualize the features that influenced a classification decision. The Classify Me! Why? web app gives an interactive example of how this can look like for a text classification task using scikit-learn [code].

autofeat

autofeat is a Python library with a linear regression and classification model that automatically engineers and then selects non-linear features that can significantly improve the prediction performance of the model. This is especially helpful if you have small datasets and/or want to be able to interpret your model to see how each input feature influences the prediction of the target. Further information can be found in the paper or my talk at the PyCon & PyData 2019 conference in Berlin.

evolvemb

evolvemb is a small Python library for creating continuously evolving word embeddings to examine word usage changes over time. Check out the paper for more details!

nlputils

nlputils is a Python library for analyzing text documents by transforming texts into TF-IDF features, using various similarity measures to compare documents, classify them with a k-nearest-neighbors classifier, and visualize them with t-SNE. Check out the Jupyter notebook with examples!

textcatvis

textcatvis is a Python library with some tools for the exploratory analysis of text datasets. It can help you better understand a collection of texts by identifying the relevant words of the documents in some classes or clusters and visualizing them in word clouds. Some examples can be found in the corresponding paper (short or long version).

Similarity Encoder (SimEc) and Context Encoder (ConEc)

SimEc is a neural network architecture for learning low dimensional representations of data points by projecting high dimensional input data into an embedding space where some given pairwise similarities between the data points are approximated linearly. For further details have a look at the corresponding paper, my PhD thesis, or this Jupyter notebook with some examples.

ConEc is a variant of SimEc for learning word embeddings. It is a simple but powerful extension of the continuous bag-of-words (CBOW) word2vec model trained with negative sampling and can be used to easily generate embeddings for out-of-vocabulary words and better representations for words with multiple meanings. Further details are described in the corresponding paper.