us presidential election
This posed a bit more of a challenge, as unlike Brexit where we had some fairly obvious hashtags used by either party, there are quite a large number of neutral tweets we were collecting under the hashtag ‘Election2016’, and a lot of tweets attacking candidates rather than necessarily showing support for one.
To address this, we had to build a custom corpus, manually classifying a few hundred tweets before using lessons from that to automate it. Once this corpus was built up, we are capable of classifying tweets as they are collected.
Our analysis here is just a comparison of the number of tweets classified for each candidate and not a prediction of the outcome.
We are also using different techniques for improving the number of geotagged tweets and were able to increase from about 1% to about 5% of all collected tweets. We are displaying those geotagged tweets in a map also comparing the quantity of classified tweets.
COUNT OF POSITIve TWEETS
election day FOR DONALD TRUMP
Election day FOR HILLARY CLINTON
How did we do it?
Similar to the run up to the EU Referendum, we wanted to analyse mass public opinion on the two Presidential Candidates. We took the concept from idea to live in 10 working days, with some modifications to the system. The tool is a reusable solution that organisations can customize to their needs: it is low cost, easy to use and quick to deliver.
A note on the presidential election. The results are simply a reflection of the tweets we have collected. It is worth noting that the very reputable fivethirtyeight website currently produces a very different result to ours. This election has been particularly susceptible to Twitter bots and this is consistent with the results we have seen. The Twitter data stream is potentially very biased and the work here is not meant to be a forecast of the election result.
- Planned and agreed desired outcome Agreed on feature prioritisation and set deadlines for our minimum viable product (MVP).
- Commenced UX Research to create initial wireframes.
- Set up the core social analytics engine.
- Analysed Twitter, searching for specific keywords (around 15 keywords referenced to the target topic).
- Python script generated for offline classification of tweets to build a corpus.
Saved data into a database (AWS Aurora dB)
including: date of tweet, language, author, content and of course, sentiment.
- To link the front end web page and the database, we made use of two Amazon Web Services tools: API Gateway and Lambda.
- From API Gateway we created a URL - which upon requesting triggers a Lambda function written in Python. The Lambda function queries the database, does some light calculations, and returns the results as JSON.
- These results make up the body of the URL’s response, which is then displayed on the webpage for the front-end to work with.
- Tweets are re-loaded into the database with the new sentiment analysis to ensure sentiment classifications are relevant for the current election.
- Created interactive front-end UI from UX wireframes.
- Integrated front-end code into SPARCK Live site.
- A couple of minor production changes for performance.
- Linked front-end code to the live system, and tested against a live stream of >300 Tweets per second to test integrity under load.
- Launched SPARCK LAB to the public through http://sparck.io/lab
What technologies did we use, and why?
Amazon Web Services (AWS) was the obvious choice for an agile, scalable, secure and cost effective environment. Prior to deploying any solution, a great deal of planning and evaluation took place in terms of the architecture, selection of technology components and all related aspects of integration. Every design was carefully reviewed and evaluated against AWS best practices and the AWS Well-Architected Framework and its pillars: Security, Reliability, Performance Efficiency and Cost Optimisation. While a full ISO 27001 ISMS (Information Security Management System) would be too much for this simple project, several advanced AWS checklists were used to ensure the AWS environment is configured in an agile standard, secure and consistent manner that allows to build up on good security and monitoring practices, regardless of future changes. Example of these best practices would include how the AWS account is configured (separate from the billing account), monitored (AWS CloudTrail/CloudWatch and external), documented (Confluence), how access is governed (complex passwords, MFA, central break-glass) and how critical data is stored (separate AWS account/off-site in a secure global repository).
Aurora RDS was chosen, as it is an AWS database engine that combines speed and high-reliability with simplicity and cost-effectiveness. It delivers up to five times the throughput of a standard MySql database running on the same hardware. (source: Amazon – AWS).
Out of the box monitoring CloudTrail was enabled on a Global level (in case any future services would be used, in other regions, thus avoiding any potential future gaps in logging) and security kept in an AWS S3 bucket with MFA delete protection and life cycle archiving. As such each API call is logged and each log entry is validated through the use of digest files, thus allowing to detect whether any log files were changed, deleted of modified since its delivery.
Cloud Watch was used for alarms and notification, as it allows for alerting through alarms and the build in Simple Notification Service.
In short, the chosen technologies allow for an agile environment that is secure, high-performant, elastic and extremely cost effective.