Charlottetown-based Ooka Island, a leader in literacy education technoloy, generates a lot of data with its learn-to-read program, Ooka Island Adventure. So much so that making sense of it has been a significant technical challenge for reading researchers.
The typical Ooka Island Adventure student will produce over 10,000 data points, one each for every click of the mouse, over the course of at least 80 hours of game play.
Data points collected per month, 2012-2014
Each data point measures, at minimum, a score value, the reading skills in question, the correct answer and the submitted answer, and the timing of the answer. Making sense of this data paves the way for futher personalization of the game, leading to more effective learning outcomes for the next generation of young readers. This science-based approach has been a key objective of the program since inception.
Growth of data points
Resonance Development has been providing technical leadership to Ooka Island since relaunching their Lighthouse Client Portal in Q2 2013. When it came time to get serious about leveraging their cache of big data, Ryan Palmer lead the effort of creating a scalable and performant system for analyzing this data.
It was determined in the early stage of development that a numerical system for generating inferred student data in a repeatable analysis framework was required. Ryan conducted exploratory work on a big data solution to support the continued processing and analysis of these records on a periodic (nightly) basis. The requirements in the long term were for a solution that will scale to many hundred million or perhaps billions of records, as the number of active users grows, and as existing users progress through the game.
Numerous numerical processing and indexing options were considered after experimentation with each.
The source data being contained within a relational database (MySQL), a purely SQL approach was first attempted, to calculate sample metrics requested by reading researchers (eg: instances whereby the number of attempts within a particular activity exceeded a certain threshold). A purely scripted (Python Data Analysis Library) with SQL input approach was also benchmarked.
Indexing options for such aggregated output were investigated, due to the less performant nature of both the SQL and scripted approaches. Apache Solr was identified as a potential candidate, since it was already in use elsewhere on this project as a denormalized, high-performance data store with the capability of performing at least basic arithmetic calculations. Both the SQL and scripted (Python) approaches suffered from lack of extensibility and repeatability, and were not easily leveraged across the various domains of metrics requested by reading researchers. Apache Solr offered only a small subset of mathematical aggregation functions required.
As overarching technical requirements emerged, it was determined that a MapReduce (MR) programming model was justified, and thus Apache Hadoop, in conjunction with supporting scripting libraries (MRJob, Python Data Analysis Library), was investigated as a more fully-featured, industry standard offering to satisfy numerical processing, indexing, and aggregation needs. It’s a solution that the team could have confidence in to scale predictably as needs grew.
An early sketch of the MapReduce process flow:
Apache Hadoop was implemented and performs MR analysis on over 6 million records on a nightly basis leveraging a hosted version of Apache Hadoop, Amazon Elastic Map Reduce. This system goes far beyond the scope and ability of the previous systems tested, calculating 3260 total metrics per user at present. This solution can be scaled up to analyze petabytes worth of data, with a predictable pay-as-you-go model. It currently takes less than an hour to process each night, and can be executed faster if desired, at additional cost.
Metrics currently calculated revolve around player performance on a per activity, per activity level, and per activity level section, as well as more basic information such as time first played, time last played, median age of scores, gender, ESL/special ed status, etc. Once computed by MapReduce, all metrics are stored both as a whole and on a per-user basis in local memory using Python proxy objects, optimized for performance and scalability, readily used by both tools for reading researchers, and by the online client reporting portal, Ooka Island Lighthouse.
Example chart previously generated with SQL, now using MR data:
To make use of such metrics, a “Metrics Dashboard” was developed for reading researchers to develop and test their hypothesis. Three database models were devised to capture necessary data for describing the conditions and circumstances for which a calculated outcome from such data would be required: Metrics, Slices, Slice Conditions. This interface resides within the existing database administration interface that was built to manage Ooka Island’s Account, User, and License records.
The Metrics Dash was delivered in early January, and its use will pave the way for an abstracted system of interventions. For each Metric (table column) configured, the computed user performance for the segment of users defined was presented by the Slice and Slice Conditions (row).
Metrics Dash showing Slices and Metrics:
As an extension of Slices and Slice Conditions, Interventions and User Interventions models have been developed to take sets of conditions and associate them with a particular action, such as sending an email notification, or holding the user back a few levels in the game, and are to be conducted in a statistically appropriate manner. These 5 interrelated models make up the basis for developing a robust system for managing interventions based on the metrics computed and hypotheses proven.
Beyond implementing interventions discussed through the hypothesis processes in place today, the next step for Ooka Island will be to implement more unsupervised learning systems that are able to generate hypotheses based on the total aggregate of data points in the system rather than from pre-aggregated data, in real time.