Projects and Research
Below are some of the projects and research I have worked on that are in the public domain.
The Cranial project was an in-house solution that we developed to run our machine learning models at the Tribune Publishing Company. The project was focused on producing a framework and toolkit for building distributed machine learning applications built with microservices. The goal was not to provide algorithms and models, but rather to create a consistent pipeline to make models built in common toolsets (scikit-learn or TensorFlow, for example) and streamline the creation of new models and their deployment to production.
The Cranial platform was focused on online learning models and grew out of our first machine learning application which was a personalized content recommendation service that was integrated across our websites, mobile apps, and newsletters ecosystems. As it was the platform was generalized, we built our propensity to subscribe model using the same code base.
In 2018 we open-sourced Cranial and it is now available under the GPL-3.0 license.
The Cranial project was split into 4 different repositories as parts of the framework are very independent.
- cranial-modeling Building machine learning models
- cranial-datastore Abstraction layer for ML development to not be dependent on any particular underlying database or datastore
- cranial-common Common utilities used in our machine learning applications
- cranial-messaging Utilities for moving data between different components and services in our machine learning applications
Identifying Eyewitnesses on Twitter
Contributors: Erika Doggett, Alejandro Cantarero
At Timeline Labs, we were building products to surface breaking news events as soon as they happened. A key component to successfully breaking news quickly is finding twitter users who witnessed an event occur, rather than those resharing it or reporters covering the story.
We developed an approach focused around shootings, unusual police activity, and protests. The approach combined three key concepts:
- Filtering twitter for socio-linguistic markers that indicate the user is an eyewitness
- Combining that eyewitness identification with context setting words that identify the event as a shooting, police activity, or protest
- Apply a spatio-temporal clustering algorithm to find tweets co-occurring near to each other in both time and location
A complete detailed analysis of the process is available here:
E. Doggett and A. Cantarero. Identifying Eyewitness News-worthy Events on Twitter. Proceedings of the 4th International Workshop on Natural Language Processing for Social Media, 2016. Link. ResearchGate.
Due to the dependence on having accurate location data for Tweets for this approach to work we also worked on location prediction on social networks, described below. Combining these two approaches led to our ability to be able to detect potentially interesting breaking news stories on a national scale.
Future work that was never completed would have been to integrate the work from this research as a feature generation step into a machine learning model.
Location Prediction on Social Networks
Contributors: Sofia Apreleva, Alejandro Cantarero
At the time we did this research, less than 5% of users on Twitter had GPS tags on their tweets. As part of our goal to surface breaking and trending news to local broadcast TV stations at Timeline Labs, increasing the amount of content we had from users in a city could potentially vastly improve the quality of our product offering.
On social networks, our daily and weekly communication patterns still tend to be very local.
It turns out that for users Tweeting in English, one can easily predict the location of a majority of users on the network (over 85%) with a median error of under 10 km, more than accurate enough to place a users' location to within a city!
We start by identifying users on the network and all communication patterns between those users (retweets, in-replies, @mentions). Each user's location can be represented by a probability distribution based on their connection, and we can then seed this graph with users with known location from their GPS coordinates. Performing a label propagation process on the graph then allows us to add locations to other users based on their connections. Full details are available in the conference proceedings.
In graduate school, one area of research I worked on was Multigrid Methods. These are a class of numerical methods for effectively solving differential equations.
My research group was particularly interested in improving the computational performance and accuracy for linear elasticity equations, used to simulate deformable objects. We used these techniques in both scientific and computer graphics simulations. Details of our exact research are below.
We were looking for methods for efficiently solving the equations of linear elasticity in the near incompressible limit. Special care must be taken to ensure that multigrid methods maintain their convergence properties in that limit. Additionaly, we were interested in how handle irregular domains with geometric multigrid methods to produce fast and efficient solvers. We handle this by embedding the domain in a regular cartesian grid.
Example of embedding a complex domain in a regular grid and the resulting deformation:
The core of my dissertation work was on Elliptic Inverse Problems. The work was a first step in a long term research project of our group to try and build surgical simulation software that could be tailored to an individual's biology.
The specific problem we studied was as follows: we are given data that were generated by an elliptic PDE that has piecewise constant coefficients. Given these data, we attempt to recover both the interface between the regions as well as the coefficients in each region. We looked at specific examples coming from Poisson's equation and linear elasticity.
Below, we show an example of our results of trying to recover the boundary between the two materials, which is made up of three distinct shapes. We start with an initial guess on the left of 9 circles. As the algorithm runs, we see it start to change shape until it finds good approximations of the different material regions on the right.
In the process of recovering the interface, we also find the parameters that define the two different materials.
J. Hegemann, A. Cantarero, C. Richardson, and J. Teran. An Explicit Update Scheme for Inverse Parameter and Interface Estimation of Piecewise Constant, Discontinuous Coefficients in Linear Elliptic PDEs. SIAM Journal on Scientific Computing, 35(2), A1098-A1119, 2013. ResearchGate.
We also developed a second, extremely fast approach to this problem, allowing us to solve at higher resolutions and with more complex geometry. The basic idea is to build functions that are approximately equal to the unknown coefficients and then recover the regions and coefficient values using a piecewise constant segmentation method. Examples of recovered shapes in 2d and 3d are shown below. Note the complexity of the shapes compared to what was produced above.