Purple Octopus - using citizen science to discover marine interactions

Processing messages


An image showing the process of converting the Facebook messages into a searchable format.Social networks such as Facebook, LinkedIn and Flickr offer access to large user communities through integrated software applications and/or a back-end API. Data held on Facebook can be accessed using the Graph API and temporarily cached for processing.

In order to analyse the problem solving capabilities of social networks a pipeline to cache messages from Facebook groups was written in PHP and JavaScript and deployed on a live server. The software makes a request for a group's messages via the Facebook Graph API. The call specifies the maximum number of messages to return (in date order, newest first) and the API returns a JSON encoded list of messages and metadata, termed here a corpus. The corpus is stored in JSON format in a MySQL database along with data about the group, such as the owner, title, description and privacy settings.

Each corpus contains a pagination link that is used to call sets of messages from a group. Pagination is used to minimise server load in processing large groups (avoiding timeout issues) and to circumvent Facebook's maximum message per call limit (500 messages). The software iterates through a group's messages from the latest message to the first message ever posted. The process of storing corpora from a group is termed here a capture.

The Facebook API is also used to find the gender of the each user, although users do not have to declare a gender or be truthful in their declaration. This is transformed into an anonymous database so users cannot directly be associated with the data held on the database. This use of data is in line with Facebook's Data Use Policy (15/11/2013).

The community page shows the Facebook groups that have been used in the analysis, chosen because they are likely to contain data about marine speices in their messages. Click on the group name to be directed to the group on Facebook itself.

Structuring Data


Structuring the data held in social network messages is essential for aggregating and visualizing the messages in a meaningful way. Ontologies, gazetteers or controlled vocabularies can be used to structure the content, although not without problems. This prototype uses, in the first instance, an ontology of marine species as a hierarchical list of named entities. Each chunk of text from a message thread is scanned for named entities from the ontology and an index table is created in a MySQL database. The prototype stores a cache of the messages however a larger implementation would only need the index table to be stored, with content stored on the social network. Query expansion is used for spelling variations, for example it is common for marine species names to be abbreviated by concatenating the genus such as "Coryphella browni" to "C. browni".

The named entity index is used to aggregate the messages allowing a user to find all content regarding a particular marine species and other species that are associated with it (what it eats, what it looks similar to, etc.). Additionally, messages containing a named entity with an image attached are used to create a gallery of photographic examples of the species. Only links to the images are stored, the images themselves are hosted on the social network. Each image is credited with the author's name.

An image of Polycera quadrilineata, a UK nudibranch (sea slug)

Challenges


The natural language processing of the message threads is a significant challenge and needs to cope with ill-formed grammar and spelling, contextual referencing and sentiment, for example:

"Is this Coryphella browni or bostoniensis?"

"I don't think this is C. browni."

"I agree with you on that."

Additionally, the ontology and identifying morphology of marine species is in constant flux, meaning identifications previously considered correct may have changed. For example, there was a significant update to the taxonomic group Chromodorididae in 2012 that rendered many static Web resources and books out of date, however users frequently correct identifications to the new nomenclature on social networks.

Publications


The data obtained using this method is of a very high quality and an initial investigation was presented at HCOMP14 in Nov 2014. Read the full conference paper or the shorter paper decribing the applications of this prototype.
Search for a Species
Latest Updates
Workshop at University of Essex's Big Data Summer School (Forthcoming, August 2015)
Jon Chamberlain will be giving a workshop on crowdsourcing on social networks at the University of Essex's Big Data Summer School (24-28 Aug, Colchester, UK).

Workshop at NFBR15 (Forthcoming, April 2015)
Jon Chamberlain will be giving a workshop on using social media as primary biodiversity data at the National Forum for Biodiversity Recording (NFBR) conference (23-25 April, Sheffield, UK).

Presentation at RCUK14 (Dec 2014)
An application of the groupsourcing approach to coral reef conservation will be presented at Reef Conservation UK 2014 (RCUK14) at ZSL, London.
Slides

Presentation at HCOMP14 (Nov 2014)
Research and demonstrations were presented to the human computation crowd at HCOMP14 in Pittsburgh.
Full conference paper | Demo paper

Presentation at LAC (Oct 2014)
Research from this application was presented to the Language and Computation (LAC) group at the University of Essex.

Prototype development (Feb 2014)
Following testing of the beta interface and feedback from marine biologists the demonstration application for the groupsourcing citizen science approach has been released. More functionality is actively being developed so please come back soon or get in contact about the project, we're always happy to talk about our research!

New publications (April 2013)
The team have a journal paper published on previous crowdsourcing work in text annotation, as well as a book chapter on when and how to use a gaming approach in crowdsourcing and citizen science.

pro-iBiosphere conference (13 Feb 2013)
Project lead Jon Chamberlain makes a presentation about crowdsourcing in biological taxonomy at the pro-iBiosphere conference in Leiden, Holland.