Processing messages
Social networks such as Facebook, LinkedIn and Flickr offer access to large user communities through integrated software applications and/or a back-end API. Data held on Facebook can be accessed using the
Graph API and temporarily cached for processing.
In order to analyse the problem solving capabilities of social networks a pipeline to cache messages from Facebook groups was written in PHP and JavaScript and deployed on a live server. The software makes a request for a group's messages via the Facebook Graph API. The call specifies the maximum number of messages to return (in date order, newest first) and the API returns a JSON encoded list of messages and metadata, termed here a
corpus. The corpus is stored in JSON format in a MySQL database along with data about the group, such as the owner, title, description and privacy settings.
Each corpus contains a pagination link that is used to call sets of messages from a group. Pagination is used to minimise server load in processing large groups (avoiding timeout issues) and to circumvent Facebook's maximum message per call limit (500 messages). The software iterates through a group's messages from the latest message to the first message ever posted. The process of storing corpora from a group is termed here a
capture.
The Facebook API is also used to find the gender of the each user, although users do not have to declare a gender or be truthful in their declaration. This is transformed into an anonymous database so users cannot directly be associated with the data held on the database. This use of data is in line with Facebook's
Data Use Policy (15/11/2013).
The
community page shows the Facebook groups that have been used in the analysis, chosen because they are likely to contain data about marine speices in their messages. Click on the group name to be directed to the group on Facebook itself.
Structuring Data
Structuring the data held in social network messages is essential for aggregating and visualizing the messages in a meaningful way. Ontologies, gazetteers or controlled vocabularies can be used to structure the content, although not without problems. This prototype uses, in the first instance, an
ontology of marine species as a hierarchical list of named entities. Each chunk of text from a message thread is scanned for named entities from the ontology and an index table is created in a MySQL database. The prototype stores a cache of the messages however a larger implementation would only need the index table to be stored, with content stored on the social network. Query expansion is used for spelling variations, for example it is common for marine species names to be abbreviated by concatenating the genus such as "Coryphella browni" to "C. browni".
The named entity index is used to aggregate the messages allowing a user to find all content regarding a particular marine species and other species that are associated with it (what it eats, what it looks similar to, etc.). Additionally, messages containing a named entity with an image attached are used to create a gallery of photographic examples of the species. Only links to the images are stored, the images themselves are hosted on the social network. Each image is credited with the author's name.
Challenges
The natural language processing of the message threads is a significant challenge and needs to cope with ill-formed grammar and spelling, contextual referencing and sentiment, for example:
"Is this Coryphella browni or bostoniensis?"
"I don't think this is C. browni."
"I agree with you on that."
Additionally, the ontology and identifying morphology of marine species is in constant flux, meaning identifications previously considered correct may have changed. For example, there was a significant update to the taxonomic group Chromodorididae in 2012 that rendered many static Web resources and books out of date, however users frequently correct identifications to the new nomenclature on social networks.
Publications
The data obtained using this method is of a very high quality and an initial investigation was presented at
HCOMP14 in Nov 2014. Read the
full conference paper or the shorter paper decribing the
applications of this prototype.