Center research presented at APHA Conference in Boston, MA
Conference Presentation Supplemental Information
Poster presented at the American Public Health Association Annual Meeting, November 2022, Boston, MA, USA
Additional details appear on this page
Use of qualitative research conventions to improve the accuracy of machine learning for detecting substance-use associated content in Twitter
Researchers have embraced the potential for social media microblogs, including Twitter, to provide real time glimpses into patterns in behaviors of concern including substance use and have explored effective use of natural language processing (NLP) models to accurately code and categorize the volume of data. Twitter data have inherent challenges associated with determining sentiment and other subtle adjustments to language which might convey humor, irony or sarcasm, and lack of context associated with relative brevity of tweets. The purpose of this presentation is to describe and assess use of qualitative research methods, including open data-driven open coding and coder consensus discussions, to develop a codebook and refine a coding process to distinguish tweets describing actual substance use from similarly constructed tweets that do not. The developed process allowed six coders to categorize nearly 5500 tweets. Intercoder reliability checks via Krippendorff’s alpha failed to consistently meet the desired score .80, ranging from .43 to .85. However, after pre-training with a sample of tweets and fine tuning, a RoBERT Model demonstrated classification accuracy ranging from .88 to .92 across three coder-developed dimensions of substance type, substance use, and intent. This suggests high performing models may not uniformly yield high reliability coefficients, and results of standard consistency assessment processes, such as Krippendorff’s alpha, should be interpreted with caution in similar circumstances.
This project aimed to use qualitative content analysis processes to improve the accuracy of natural language processing algorithms to use real time content contributed to microblogs, such as Twitter, to quickly identify trends in substance use. Prior researchers have identified drug-related tweets through use of keyword lists (Graves et al., 2018), and in some instances identified additional synonyms (Simpson et al., 2018) or identified other words to trigger rejection of Tweets (e.g., Daniulaityte et al., 2015; Lamy et al., 2016). Researchers who combined hand coding with machine learning typically described basic screening schemes developed to include or exclude Tweets (Mackey et al., 2017; Oduwa Edo-Osagie et al., 2019). Other than Mackey et al. (2017), who expanded their initial keyword list to include co-occurring terms, authors typically described limited use of hand coding in preparation for machine learning methods.
Challenges in use of microblog data provided as lists of tweets include lack of ample content and context to facilitate quick, confident categorization via either hand screening or machine learning models. The brevity of tweets presents challenge as does lack of access to real time responses and exchanges. Twitter is instant and current, so things that inspire tweets are not always identified, or, if identified, may not be retrievable. The intent of a given tweet may be to comment or inform, but may alternately be to express humor, sarcasm, or irony, or to exaggerate a situation or behavior or to deliberately communicate inaccurate information. Tweets also at times include unintentional errors that may or may not be easy to identify and interpret. Through the process of considering strategic surveillance of tweets, researchers from public health and computer science agreed that developing and applying a more robust process to identify and classify relevant tweets would facilitate development of more accurate machine learning processes. Therefore, the researchers aimed to use a set of live tweets to develop and refine a clear and comprehensive codebook, which would allow non-relevant tweets to be identified and enhance coder ability to accurately and comprehensively classify relevant tweets.
For this project, the researchers and stakeholders were interested in multiple aspects of substance-use related tweets: specifically, the substance being used (subcategory “substance,”); whether the tweet describes specific use or other drug-related content (subcategory “use,”); the context of use (subcategory “intent”)– historical, current, or planned”). Tweets used for codebook development consisted of a combination of Tweets identified through a list of common drug keywords and randomly selected tweets taken from a two-week time period and originating in an urban area in the U.S. An iterative process was used to develop and refine a Twitter codebook, which was eventually expanded to comprise an entire coder training program, designed for use with one or more sets of sample tweets. Each subcategory was denoted by an initial or brief acronym, to facilitate quicker coding.
During a period of several months, 9 individuals participated in tweet coding. Coders met weekly to arrive at consensus for ambiguous tweets, and each tweet was coded by at minimum two individuals. Two and three-way reliability comparisons via Krippendorff’s alpha were run for the three dimensions on a bi-weekly basis. Krippendorff’s alpha is an assessment of interrater reliability, or agreement among multiple coders, commonly used for categorical data. Sample reliability results included .735 on substance, .657 for use, .488 for intent. Although Krippendorff (2004 ) suggested .80 was the desired level, he noted .667 as a reasonable level if “tentative conclusions are acceptable” (p. 430), for instance when the reliability assessment is not related to critical or life-altering decisions. Therefore given relatively consistent occurrence of reasonable alpha coefficients, coding and codebook refinement proceeded until approximately 5,500 tweets were considered by at least two coders each.
To assess the contribution to accuracy made by development and use of this detailed and systematic hand coding process, a RoBERT model was used. First the RoBERT model was given pre-training through use of 1,000,000 tweets from January 1, 2020. Following, the model was fine-tuned for the three classification tasks – substance type, substance use, and substance intent, through use of 4000 manually labeled tweets. Accuracy ranged from .879 for intent to .924 for type. This was consistent with the alpha responses in terms of higher versus lower ratings, suggesting that among the three categories, “intent” remained most challenging to classify with consistency.
Next steps include use of some more recent Twitter data to assess and refine the codebook, followed by additional assessment of RoBERT models, and potential assessment of this process with other sources for text-based social media data.