Guest post by Jakob Ludewig, Senior Data Scientist, Dalia Research
The transition of the market research industry away from telephone and face-to-face interviews towards online platforms has massively increased the speed and reach of data collection. Modern online survey platforms, such as Dalia Research’s, allow millions of users every day to share their thoughts on politics, social issues, or consumer behavior.
Survey Fraud is Increasing
However, inviting the whole internet to share their thoughts and opinions will inevitably attract fraudsters. In the world of online polling fraud usually manifests itself as unwanted patterns in the survey responses: users deliberately giving nonsensical or random answers, lying about their demographic profiles or trying to bypass security measures to answer the same survey multiple times.
If this kind of behavior goes undetected the data will become unreliable, leading to wrong business or political decision-making. As fraud patterns can be very complex and are constantly changing, data quality remains one of the most difficult problems of the online research industry.
Building a Model
At Dalia, a task force was created to address these issues and ensure the quality of the data generated through the platform. One of the actions of this task force was to build a preventive mechanism that would predict fraudulent behavior even before the user entered any surveys.
For that purpose machine learning was identified as the ideal tool. Having already built several successful machine learning solutions (including a whole market research product) the team could draw on past experiences. The goal of this particular model was to predict the probability of a user committing fraud in the system. This probability could then be used to decide in real-time whether or not to allow a user to enter a survey.
Getting the Data
In survey research fraud is usually spotted during an in-depth, manual analysis of the survey data. This process is either performed directly in Dalia’s own platform or in the data collection platforms of external partners. Depending on the integration with each respective partner this information can be communicated in various different ways (sent via APIs, CSVs downloaded from websites, email text reports, etc.).
Therefore the first step in this project focused on harmonizing and automating the integration of this data into the internal production databases. This allowed for an easy match of the target variable with all the information included about the fraudulent user interaction in the system.
The second step was to set up a replication process of this data into Amazon S3 so it could be queried by Amazon Athena. Amazon Athena is a query service that allows to analyze and extract data stored in S3 in an efficient way using standard SQL. This way the generation of the training data could be sped up and decoupled from the production databases.
Prototyping and Fine-Tuning the Model
Using this training data a first prototype of the model could be built and evaluated. This process was carried out in Amazon SageMaker, a cloud-based machine learning platform which offers solutions for hosting Jupyter Notebook servers, accessing computing resources for model training, automated hyperparameter-tuning, as well as model deployment services.
As a model an XGBoost classifier was chosen which is known to show consistently high performance across a wide range of problems. It also offers very efficient training jobs even on large datasets which enables fast iterations during the model prototyping phase and the inclusion of a great number of variables. A built-in version of XGBoost is also available in Amazon SageMaker.
Apart from that XGBoost can be configured to fulfill certain important properties such as producing calibrated probabilities. This property guarantees that the predictions of the model can be interpreted in the following sense: If, for example, the model predicts a 10 % probability of a given user interaction to be fraud it will also be fraudulent in reality about 10 % of the time. This is important as it makes the output of the model actionable from a business perspective.
In this setup, a first version of the model was trained on millions of observations with hundreds of features. This model was further optimised using Amazon SageMaker hyperparameter tuning jobs in which hundreds of models were trained with different combinations of model parameters.
Integrating the Solution in the Platform
At this point the model was ready to be used to evaluate user behavior in the platform and block potentially fraudulent activities.
It was agreed with the engineering team that the model was to be served through a REST API accepting a JSON payload with all the model’s predictors. This made it necessary to build a pre-processing step into the API that would parse the JSON payload into suitable input for the XGBoost model. This preprocessing step was implemented as a second model that would be combined with the XGBoost model in a model pipeline. This combined model could then be deployed to an Amazon Sagemaker endpoint which handles the necessary deployment and auto-scaling details.
For more details on how to implement such a model pipeline please find the example code provided here.
In parallel, the engineering team developed the functionality to query the endpoint from the data collection platform. This was done using the SageMaker Ruby Gem which also handles the authentication with the endpoint.
In total, the implementation of the solution took a team of one backend engineer and one data scientist approximately 2 months from the ideation phase to production deployment. This was a considerable speed-up over previous projects where model training and deployment was done on a custom-built platform, introducing significant development and maintenance overhead.
Reducing Fraudulent Activities
During the modeling phase the model already showed very high performance during cross-validation, offering a very favorable tradeoff between false positives (revenue lost due to unnecessary blocking) and false negatives (undetected fraudulent activity).
However, results from the modeling phase do not necessarily translate into real world gains as not all aspects of the production environment can be replicated. Therefore the final evaluation of the model had to be done after deployment.
In this particular case, the model was activated alongside two other third-party fraud detection solutions which immediately led to a significant drop in fraudulent activity in the system. To identify which of the three solutions contributed most to this drop an A/B test was run in which Dalia’s model outperformed both other candidates.
What is more, these results were accompanied by very positive feedback from Dalia’s external partners, welcoming the efforts to improve data quality. One of the largest buyers in the market shared that Dalia’s continuous efforts in improving quality brought them “well below [the] platform wide rate” for fraud incidences, measured now around 50 % lower than the average across hundreds of other suppliers.
Since its initial deployment many improvements were made to the model and the surrounding infrastructure. One important step was the integration of the Metaflow framework. Utilizing Metaflow’s close integration with AWS services such as AWS Batch and Step Functions the model training could be further automated while reducing costs. Additionally, significant gains could be made regarding the maintainability, monitoring, and reproducibility of model runs, laying the groundwork for future improvements