Snorkel AI > Case Studies > Georgetown University’s CSET Leverages Snorkel Flow for NLP Applications in Policy Research

Georgetown University’s CSET Leverages Snorkel Flow for NLP Applications in Policy Research

Technology Category

Sensors - Flow Meters
Sensors - Liquid Detection Sensors

Applicable Industries

Cement
Education

Applicable Functions

Product Research & Development
Quality Assurance

Use Cases

Chatbots
Machine Translation

Services

Data Science Services
Training

About The Customer

The Center for Security and Emerging Technology (CSET) is a policy research organization within Georgetown University’s Walsh School of Foreign Service. It produces data-driven research on security and technology and provides non-partisan analysis to the policy community. CSET is committed to preparing a new generation of decision-makers to address the challenges and opportunities of emerging technologies such as artificial intelligence, advanced computing, and biotechnology. It provides unprecedented coverage of the emerging technology ecosystem and its security implications, bolstered by novel methods to classify and analyze research and technical outputs from diverse sources, including foreign-language materials.

The Challenge

The Center for Security and Emerging Technology (CSET) at Georgetown University was faced with the challenge of building NLP applications to classify complex research documents. The goal was to surface scientific articles of analytic interest to inform data-driven policy recommendations. However, the team found that a large-scale manual labeling effort would be impractical. They initially experimented with the Snorkel Research Project, which allowed them to programmatically label 90K data points within weeks, achieving 77% precision. However, the collaboration between data scientists and subject-matter experts was time-consuming and inefficient, involving spreadsheets, Slack channels, and Python scripts. This workflow made improving data and model quality a slow process. The team was constrained by inefficient tooling to auto-label, gain visibility into data, and improve training data and model quality. The lack of an integrated feedback loop from model training and analysis to labeling also meant that data scientists and subject matter experts had to spend long cycles re-labeling data to match evolving business criteria. These challenges limited the team’s capacity to deliver production-grade models, shorten project timelines, and take on more projects.

The Solution

CSET's data scientists attended Snorkel's The Future of Data-centric AI conference and decided to explore Snorkel Flow, a data-centric AI platform, as a potential solution. Snorkel Flow drastically reduced labeling, model training, and iteration time, and better equipped CSET’s data science team to collaborate closely with analysts to gather, process, and interpret data at scale. The team was able to create 60+ labeling functions to programmatically label 107K data points using advanced features such as keyword LFs, auto-suggest LFs, cluster LFs, and more. They also used embedding similarity and negative sampling to improve the representation of the negative class. Snorkel Flow provided the ability to pinpoint data slices for domain expert spot-checks and troubleshooting to improve accuracy, powering an active learning workflow. The platform also improved collaboration between domain experts and data scientists with an easy-to-use GUI to author LFs and used comments and tags to discuss and resolve complex cases efficiently. It increased productivity with advanced LFs based on foundation-model embedding distances and clustering, and reduced time to adapt with guided error analysis and prioritized examples for targeted manual review using active learning.

Operational Impact

The implementation of Snorkel Flow resulted in a significant improvement in the collaboration between data scientists and domain experts. The easy-to-use GUI for authoring labeling functions, along with the use of comments and tags for discussion and resolution of complex cases, made the process more efficient. The advanced labeling functions based on foundation-model embedding distances and clustering increased productivity. The guided error analysis and prioritized examples for targeted manual review using active learning reduced the time needed to adapt to evolving business criteria. This solution eliminated a lot of friction in data science and domain expert collaboration, bringing domain experts into the loop during the model development process, significantly improving project buy-in, knowledge transfer, and productivity.

Quantitative Benefit

Programmatically labeled 107K data points using advanced features
Achieved 85% precision on positive class, an eight percentage-point improvement over the previous solution
Significant reduction in labeling time