Snorkel AI > Case Studies > Georgetown University’s CSET Leverages Snorkel Flow for NLP Applications in Policy Research

Georgetown University’s CSET Leverages Snorkel Flow for NLP Applications in Policy Research

Snorkel AI Logo
Technology Category
  • Sensors - Flow Meters
  • Sensors - Liquid Detection Sensors
Applicable Industries
  • Cement
  • Education
Applicable Functions
  • Product Research & Development
  • Quality Assurance
Use Cases
  • Chatbots
  • Machine Translation
Services
  • Data Science Services
  • Training
About The Customer
The Center for Security and Emerging Technology (CSET) is a policy research organization within Georgetown University’s Walsh School of Foreign Service. It produces data-driven research on security and technology and provides non-partisan analysis to the policy community. CSET is committed to preparing a new generation of decision-makers to address the challenges and opportunities of emerging technologies such as artificial intelligence, advanced computing, and biotechnology. It provides unprecedented coverage of the emerging technology ecosystem and its security implications, bolstered by novel methods to classify and analyze research and technical outputs from diverse sources, including foreign-language materials.
The Challenge
The Center for Security and Emerging Technology (CSET) at Georgetown University was faced with the challenge of building NLP applications to classify complex research documents. The goal was to surface scientific articles of analytic interest to inform data-driven policy recommendations. However, the team found that a large-scale manual labeling effort would be impractical. They initially experimented with the Snorkel Research Project, which allowed them to programmatically label 90K data points within weeks, achieving 77% precision. However, the collaboration between data scientists and subject-matter experts was time-consuming and inefficient, involving spreadsheets, Slack channels, and Python scripts. This workflow made improving data and model quality a slow process. The team was constrained by inefficient tooling to auto-label, gain visibility into data, and improve training data and model quality. The lack of an integrated feedback loop from model training and analysis to labeling also meant that data scientists and subject matter experts had to spend long cycles re-labeling data to match evolving business criteria. These challenges limited the team’s capacity to deliver production-grade models, shorten project timelines, and take on more projects.
The Solution
CSET's data scientists attended Snorkel's The Future of Data-centric AI conference and decided to explore Snorkel Flow, a data-centric AI platform, as a potential solution. Snorkel Flow drastically reduced labeling, model training, and iteration time, and better equipped CSET’s data science team to collaborate closely with analysts to gather, process, and interpret data at scale. The team was able to create 60+ labeling functions to programmatically label 107K data points using advanced features such as keyword LFs, auto-suggest LFs, cluster LFs, and more. They also used embedding similarity and negative sampling to improve the representation of the negative class. Snorkel Flow provided the ability to pinpoint data slices for domain expert spot-checks and troubleshooting to improve accuracy, powering an active learning workflow. The platform also improved collaboration between domain experts and data scientists with an easy-to-use GUI to author LFs and used comments and tags to discuss and resolve complex cases efficiently. It increased productivity with advanced LFs based on foundation-model embedding distances and clustering, and reduced time to adapt with guided error analysis and prioritized examples for targeted manual review using active learning.
Operational Impact
  • The implementation of Snorkel Flow resulted in a significant improvement in the collaboration between data scientists and domain experts. The easy-to-use GUI for authoring labeling functions, along with the use of comments and tags for discussion and resolution of complex cases, made the process more efficient. The advanced labeling functions based on foundation-model embedding distances and clustering increased productivity. The guided error analysis and prioritized examples for targeted manual review using active learning reduced the time needed to adapt to evolving business criteria. This solution eliminated a lot of friction in data science and domain expert collaboration, bringing domain experts into the loop during the model development process, significantly improving project buy-in, knowledge transfer, and productivity.
Quantitative Benefit
  • Programmatically labeled 107K data points using advanced features
  • Achieved 85% precision on positive class, an eight percentage-point improvement over the previous solution
  • Significant reduction in labeling time

Case Study missing?

Start adding your own!

Register with your work email and create a new case study profile for your business.

Add New Record

Related Case Studies.

Contact us

Let's talk!
* Required
* Required
* Required
* Invalid email address
By submitting this form, you agree that IoT ONE may contact you with insights and marketing messaging.
No thanks, I don't want to receive any marketing emails from IoT ONE.
Submit

Thank you for your message!
We will contact you soon.