Visit the Pennsylvania State University Home Page

Center for Trustworthy Machine Learning

  • Home
    • Background
    • Our Research
  • Outreach & Education
  • People
    • Investigators
    • Industrial Advisory Board
    • Graduate Students
    • Undergraduate Students
  • Publications
  • Data Sets

Data Sets

;Data is the new currency of our day and age and is being utilized to train sophisticated models for prediction and classification tasks. These models are revolutionizing many domains including but not limited to healthcare, technology, and manufacturing. A model is often as good as its data, as a result data should be thoroughly cleaned and formatted before it is analyzed. Information for popular datasets is sometimes sparse and better documentation is necessary. To sponsor the growth of Machine Learning and streamline the training process we have complied various datasets and information pertaining to them. 

Please check back often as we add to this list.

NEW DATA SETS–

1. Measuring Massive Multitask Language Understanding; data set containing OpenAI API evaluation code;

2. Natural Adversarial Examples; data set of real-world, unmodified naturally occurring examples that cause ML model performance degradation.

3. Measuring Coding Challenge Competence with APPS; a repository containing evaluation code.

4. Measuring Mathematical Problem Solving with the MATH dataset; repository containing dataset loaders and evaluation code.

5. Aligning AI with Shared Human Values; repository, folders contain fine-tuning scripts for individual tasks of the ETHICS benchmark. There is also an interactive script to probe a commonsense morality model and a utilitarianism model.

6. Forecasting Future World Events with Neural Networks  Forecasts of climate, geopolitical conflict, pandemics and economic indicators help shape policy and decision making. 

8. Natural language descriptions of distribution shifts

9. Combined Anomalous Object Segmentation (CAOS)

10. ImageNet-R(endition) and DeepAugment

11. Anomalous models for auditing visualization methods 

12. Reward hacking environments

13. Measuring moral behavior in reinforcement learning agents

14. Using data augmentation to improve robustness

 

 Visit the Pennsylvania State University Home Page
Copyright 2025 © The Pennsylvania State University Privacy Non-Discrimination Equal Opportunity Accessibility Legal

Support for the Center for Trustworthy Machine Learning (CTML) is provided through NSF Grant #(CNS-1805310), part of the NSF Secure and Trustworthy Cyberspace Program. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Additional support is provided byPenn State University,Stanford University,UC Berkeley,UC San Diego,University of Wisconsin,andUniversity of Virginia.