1. Utilize Privacy-Preserving Data Mining Techniques and Synthetic Data to Protect Sensitive Data While Enabling Advanced Analytic Model Development.
The collection and use of sensitive data—including personally identifiable information and protected health information such as genomics, clinical data, and other real-world data—is increasing dramatically across the scientific community. Example uses of sensitive data in scientific computing are analysis of post-market surveillance adverse event reports, development of advanced analytic clinical decision support models, and genetic biomarker research. This new paradigm in scientific computing relies on data access and collaborative analysis of real-world data rather than the traditional isolationist model, which makes innovative, privacy-preserving approaches to data mining and synthetic data generation absolutely crucial. Below we describe four notable approaches to privacy-preserving data mining and synthetic data generation. Considerations for selecting the appropriate method include privacy guarantees, data realism, and data access requirements.
- Federated learning is a privacy-preserving data mining technique that makes it possible for data to be available in a collaborative, accessible environment while still remaining secure in its original server. Federated learning works by training a machine learning algorithm on multiple local datasets contained in local nodes without explicitly exchanging data samples.
- Differential privacy is a method in which algorithms that are differentially private manifest themselves by presenting behavior that does not change in a way that is traceable to any single individual either joining or leaving the dataset.
- Model-to-data approaches also address data privacy concerns by releasing synthetic data for artificial intelligence and machine learning (AI/ML) model development while withholding sensitive data in a secure private computational environment for model evaluation.
- Synthetic data generation, based on algorithms such as generative adversarial networks, is able to produce data that is a realistic alternative to sensitive real-world data.
2. Leverage Hybrid and Cloud Computing to Improve Data Sharing, Accelerate Computing, and Reduce Development Burden.
As the computational infrastructure demands of data management and analytics have increased and as workforces, in general, have become more geographically dispersed, it is critical for scientific computing to provide computing environments that are scalable and available on-demand. A hybrid on-premises and cloud computing approach enables scientific computing organizations to leverage their already-owned infrastructure, such as on-premises high-performance computing clusters, while still receiving the benefits of cloud computing. For example, bursting excess computational demand to a public cloud can accelerate scientific discovery by reducing the time spent waiting for available computational resources. Public cloud environments also offer services (e.g., assistance with automating parts of the ML and AI model development process), models (e.g., for natural language processing, demand forecasting, and equipment monitoring), and specialized hardware (e.g., graphics processing units) that can accelerate AI/ML adoption, development, and operationalization.
3. Boost Workforce Knowledge Management, Training, and User Adoption Programs to Increase Organizational Resilience to a Competitive Scientific Computing Labor Market.
Knowledge management, training, and user-friendliness are crucial for promoting the adoption of novel technology. Within the scope of scientific computing, ease-of-use can drastically impact long-term rates of innovation by democratizing access to powerful toolsets, and freeing those with specialized training and experience to focus on the most challenging problems. Computational scientists are in high demand making it critical for organizations to effectively onboard staff and to ensure the retention of knowledge that may otherwise be lost to attrition. The management and maintenance of technical institutional knowledge—for example, the standard operating procedures for computing infrastructure and laboratory processes—doesn’t happen by itself. It must be incentivized. Retrieval of documented knowledge is improved through cognitive search capabilities that combine indexing, automated curation, and artificial intelligence to provide personalized, relevant results, and reduce information overload. Knowledge management must enable reproducibility in order to increase efficiency in large working groups. Computational research practices including the use of version control systems, documentation of code using markup languages, and provision of execution instructions, increase scientific computing reproducibility. Furthermore, workflow languages allow for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments from workstations to cluster, cloud, and high-performance computing (HPC) environments. Training programs have shown marked success in motivating usage of avant-garde technology. The most effective programs emphasize addressing user needs, using clear communication, and providing an engaging experience. Oftentimes, incorporating data science training with access to HPC infrastructure is an enticing proposition for experts unfamiliar with the intricacies of technical scientific computing. Coaching and paired programming are attractive pathways to improving technical skillsets as these provide a human component to learning.
4. Utilize Tech Scouting to Enable Informed Strategic Planning and Technology Adoption while Improving Staff Retention.
As demonstrated through innovations in data generation, computing infrastructure, and advanced analytics, scientific computing resources are constantly evolving and improving. Technology scouting is the process by which emerging trends and new applications are captured from existing technologies, products, or services from both domestic and international public and private sectors. Tech scouting not only facilitates the adoption of such technologies in a timely, streamlined manner but also helps organizations develop effective strategic plans for integrating or otherwise responding to current and developing technologies. On top of this, exposure to emerging technology trends can increase scientific computing talent retention by keeping staff engaged and excited.