Student Projects

About Us

Our team creates powerful supercomputers for modeling and analyzing the most complex problems and the largest data sets to enable revolutionary discoveries and capabilities. Many of these capabilities have been developed and published in partnership with amazing students (see our Google Scholar page).

Here are a selection of videos describing some of our work:

Listed below are a wide range protential projects in AI, Mathematics, Green Computing, Supercomputing Systems, Online Learning, and Connection Science. If you are interested in any of these projects, please send us email at supercloud@mit.edu (please avoid using ChatGPT or another LLM to write your email).

AI Projects

Speeding up Architecture and Hyper-Parameter Searches
Architecture and hyper-parameter searches are a major part of the AI development pipeline. This workflow typically consists of training a large number of models to identify architectures best suited for a given mission. These searches can take a significant amount of compute resources and time. Our project aims to develop new approaches to early identification of optimal architectures or hyper-parameters by modeling the loss curve trajectories during the training process. There are two approaches we are developing.
1. Training performance estimation (TPE) – TPE has been shown to be very effective at estimating the converged training performance of graph neural networks. We are currently exploring the application of TPE to other models and domains.
2. Loss curve gradient estimation (LCGA) – The LCGA approach aims to model the curvature of the training loss and has been shown to effective at identifying optimal architectures across different model families while also maintaining the relative ordering of model losses.
  Next steps in this research involve
  - Extending LCGA as an early stopping mechanism – By modeling the training loss curve, it may be possible to identify the optimal number of epochs required to train a model beyond which the model does not improve substantially.
  - Trainless architecture searches – By modeling the loss curves across a large family of deep neural network architectures, it may be possible to identify optimal new architectures without the need for training every possible architecture.
    References:
  - Energy-Aware Neural Architecture Selection and Hyperparameter Optimization
  - Loss Curve Approximations for Fast Neural Architecture Ranking & Training Elasticity Estimation
  - From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference
  - Kairos: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources
  - Green Carbon Footprint for Model Inference Serving via Exploiting Mixed-Quality Models and GPU Partitioning
AI Analysis of User Interactions
The SuperCloud team communicates and provides assistance to users via email. What can we learn from the email and Zoom communications to improve SuperCloud and user experience on SuperCloud. We want to build an infrastructure to collect, parse and analyze the email communications to improve SuperCloud and user experience.
Hierarchical Anonymized AI
The costs of adversarial activity on networks are growing at an alarming rate and have reached $1T per year. 90% of Americans are now concerned about cyber-attacks; a level of public concern that is greater than pandemics and nuclear war. In the land, sea, undersea, air, and space operating domains observe-pursue-counter (detect-handoff-intercept) walls-out architectures have proven cost effective. Our recent innovations in high performance privacy-preserving network sensing and analysis offer new opportunities for obtaining the required observations to enable such architectures in the cyber domain. Using these network observations to pursue and counter adversarial activity requires the development of novel privacy-preserving hierarchical AI analytics techniques that explore connections both within and across the layers of the knowledge pyramid from low-level network traffic to high-level social media.
References:

Mathematics Projects

Mathematics of Big Data & Machine Learning
Big Data describes a new era in the digital age where the volume, velocity, and variety of data created across a wide range of fields is increasing at a rate well beyond our ability to analyze the data. Machine Learning has emerged as a powerful tool for transforming this data into usable information. Many technologies (e.g., spreadsheets, databases, graphs, matrices, deep neural networks, …) have been developed to address these challenges. The common theme amongst these technologies is the need to store and operate on data as tabular collections instead of as individual data elements. This project explore the common mathematical foundation of these tabular collections (associative arrays) that apply across a wide range of applications and technologies. Associative arrays unify and simplify Big Data and Machine Learning. Understanding these mathematical foundations enables seeing past the differences that lie on the surface of Big Data and Machine Learning applications and technologies and leverage their core mathematical similarities to solve the hardest Big Data and Machine Learning challenges.
References:
- Mathematics of Big Data
Catastrophe vs Conspiracy: Heavy Tail Statistics
Heavy-tail distributions, where the probability decays slower than exp(-x), are a natural result of multiplicative processes and play an important role in many of today’s most important problems (pandemics, climate, weather, finance, wealth distribution, social media, …). Computer networks are among the most notable examples of heavy-tail distributions, whose celebrated discovery led to the creation of the new field of Network Science. However, this observation brings with it the recognition that many cyber detection systems use light-tail statistical tests for which there may be no combination of thresholds that can result in acceptable operator probability-of-detection (Pd) and probability-of-false-alarm (Pfa). This Pd/Pfa paradox is consistent with the lived experience of many cyber operators and a possible root cause is the potential incompatibility of light-tail statistical tests on heavy-tail data. The goal of this effort is to develop the necessary educational and training tools for effectively understanding and applying heavy-tail distributions in a cyber context.
References:
Abstract Algebra of Cyberspace
Social media, e-commerce, streaming video, e-mail, cloud documents, web pages, traffic flows, and network packets fill vast digital lakes, rivers, and oceans that we each navigate daily. This digital hyperspace is an amorphous flow of data supported by continuous streams that stretch standard concepts of type and dimension. The unstructured data of digital hyperspace can be elegantly represented, traversed, and transformed via the mathematics of hypergraphs, hypersparse matrices, and associative array algebra. This work will explore a novel mathematical concept, the semilink, that combines pairs of semirings to provide the essential operations for network/graph analytics, database operations, and machine learning.
References:
Mathematical Underpinnings of Associative Array Algebra
Semirings have found success as an algebraic structure which can support the variety of data types and operations used by those working with graphs, matrices, spreadsheets, and database, and form the mathematical foundation of the associativey array algebra of D4M and matrix algebra of GraphBLAS. Mirroring the fact that module theory has many but not all of the structural guarantees of vector space theory, the semimodule theory has some but not all of the structural guarantees of module theory. The added generality of semirings allows semimodule theory to consider structures wholly unlike rings and fields like Boolean algebras and the max-plus algebra. By focusing on these special cases which are diametrically opposed to the traditional ring and field cases, analogs of standard linear algebra like eigenanalysis and solving linear systems. This work will further explore the theory of semirings in the form of solving linear systems, carrying out eigenanalysis, and graph algorithms.
References:

Supercomputing Systems Projects

MIT/Stanford Next Generation Operating System
The goal of the MIT/Stanford DBOS–the DBMS-oriented Operating System–is to build a completely new operating system (OS) stack for distributed systems. Currently, distributed systems are built on many instances of a single-node OS like Linux with entirely separate cluster schedulers, distributed file systems, and network managers. DBOS uses a distributed transactional DBMS as the basis for a scalable cluster OS. We have shown that such a database OS can do scheduling, file management, and inter-process communication with competitive performance to existing systems. It can additionally provide significantly better analytics and dramatically reduce code complexity by building core OS services from standard database queries, while implementing low-latency transactions and high availability only once. We are currently working on building a complete end-to-end prototype of DBOS. This project will exploring implementing next generation cyber analytics within DBOS.
References:
Supercomputing and Cloud interoperability
Shared Supercomputing typically resources offer a limited set of hardware and software combinations that researchers can leverage. At the same time, the absolute number of resources offered are also limited physically. Commercial cloud providers can offer an avenue to leverage additional resources as well as new/unique hardware as needs arise. Thus, having the technology to seamlessly transition between a shared resource such as the MIT SuperCloud and a commercial cloud provider (AWS, Microsoft Azure, Google Cloud) can significantly increase user productivity and enable new research. This project aims to
- Make the MIT SuperCloud software stack available as a deployable image on commercial cloud providers
- Develop tools to seamlessly transition between SuperCloud and cloud as requirements change
- Enable sponsors and funding agencies to provide a standard AI stack that can be leveraged by performers and the broader community
Performance Tuning of Large-Scale Cluster Management Systems
Modern supercomputers rely on a collection of open-source and bespoke custom software to handle node and user provisioning, system configuration, configuration persistence, change management, monitoring and metrics gathering and imaging. The MIT SuperCloud system’s routine monthly maintenance includes a full reimage and reinstall of the operating system, all software and configuration files to ensure the reliability of our imaging system as well as to maintain a consistent state for our users, preventing the accumulation of incidental changes which could complicate troubleshooting and interfere with the running of user jobs. The frequency with which we reimage nodes necessitates that the process be streamlined and optimized such that node reinstallation is as quick and reliable as possible. This project would explore methods to refine our node installation procedures and search for new efficiencies furthering our ability to manage and maintain very large systems.
Datacenteric AI
The Datacentric AI project aims to develop revolutionary data centric systems that can enable edge-to-datacenter scale computing while also providing high performance and accuracy for AI tasks, high productivity for AI developers using the system, and self-driven resource management of underlying complex systems. Rapidly evolving technologies such as new computing architectures, AI frameworks, supercomputing systems, cloud, and data management are the key enablers of AI and the speed at which they develop is outpacing the ability of AI practitioners to leverage optimally. As this compute capability, AI frameworks, and data diversity have grown, AI models have also evolved from traditional feed-forward or convolutional networks that employ computationally simple layers to more complex networks that use differential equations to model physical phenomenon. These new classes of algorithms and massive model architectures need new types of data-centric systems that can help map the novel computing requirements with ever complex hardware platforms such as quantum processors, neuromorphic processors and datacenter scale chips. A data-centric system would need revolutionary operating systems, ML-enhanced data management, highly parallel algorithms, and workload-aware schedulers that can automatically map workloads to heterogenous hardware platforms. By developing technologies to address these needs of future AI systems, this project aims to provide Lincoln and DoD researchers with the tools to address the needs of future AI systems.
Parallel Python Programming
There are plethora of libraries to enable parallel programming with Python programming language but little has been done with partitioned global array semantics (PGAS) approach. Using PGAS approach, one can deploy a parallel capability that provides good speed-up without sacrificing the ease of programming in Python. This project will explore the scalability and performance of the preliminary implementation of PGAS in Python and compare its performance with other libraries available for Python parallel programming, and potentially seeking further performance optimization in the current PGAS implementation.
References:
- pPython for Parallel Python Programming
- pPython Performance Study
3D Visualization of Supercomputer Performance
There are a number of data collection and visualization tools to assist in the real time performance analysis of High Performance Computing(HPC) systems but there is a need to analyze past performance for systems troubleshooting and system behavior analysis. Optimizing HPC systems for processing speed, power consumption, and network optimization can be difficult to do in real time so a system to use collected data to “rerun” system performance would be advantageous. Gaming engines, like Unity 3D, can be used to build virtual system representations and run scenarios using historical or manufactured data to identify system failures or bottlenecks and fine tuned to optimize performace metrics.
References:
- 3D Real-Time Supercomputer Monitoring
- Large Scale Network Situational Awareness via 3D Gaming Technology
Data Analytics and 3D Game Development
The LLSC operates and maintains a large number of High Performance Computing clusters for general-purpose accelerated discovery across many research domains. The operation of these systems requires access to detailed information regarding the status of systems schedulers, storage arrays, compute node status, network data, and data center conditions. The collections of data represents the collection of over 80 million data points per day. Effectively correlating this volume of data into actionable information requires innovative approaches and tools. The LLSC has developed a 3D Monitoring and Management platform by leveraging Unity3D to render the physical data center space and assets into a virtual environment which strives to provide a holistic view of the HPC resources in a human digestible format. Our goal is to achieve a level of situational awareness that enables the operations team to identify and correct issues before they negatively impact the user experience. Some near term goals are to fold the innovative Green Data Center challenge work and data into the M&M system to enable the identification of carbon impacts of different job profiles across a heterogeneous compute environment.
References:

Education, Training, and Outreach Projects

Evaluate the State of User Applications
- Capture the most commonly used applications/workflows
- Compare to existing set of teaching examples
- Design a prioritized suite of new examples or updates to existing examples
- Highlight topics that would make good micro-lessons
Expand the Suite of Teaching Examples
- Use the prioritized list created via project defined above or start with code snippets that we currently have
- Identify areas/scripts that would be beneficial to users:
- Build whole learning module, e.g. tensorboard
- documentation converted to scripts (where possible)
- clean up scripts – (ascii art) to make it clearer for user
- testing
- include description of what is happening on systems
- include information on impact to system and user
- create mini-workshop
- convert workshop to video & hands-on module
Evaluation of Educational Games
- Explore the literature to understand methods for evaluating learning
- Evaluate our HPC Games
- What data should we collect? survey design? is there any data that we should look for in the online version?
- recommend a formal education plan for each game
  References:
  - A Data Driven Approach to Informal HPC Training Evaluation
Knowledge Graph from Video Scripts
- NLP
- align with content in course(s)
Learning Analytics (WPLA – workplace learning analytics)
- review of WPLA literature, best practices
- using SLURM data and Edly Data to evaluate effectiveness of courseware
- path from Jupyter NB to batch jobs
- efficiency of jobs
- email requests aligned with learning modules – do we lower the amount of simple email?
Exploring User Data
- Prior experience
- Departments, position, affiliation
- How long are they using the system for? “Turnover”?
- Combine with Slurm data- how much are people using the system? Correlations between system use and how long they’ve had their account?
User Engagement
Supercomputing operations can be further optimized by actively working with users who are running large scale compute in the datacenter. This can take the form of education for optimizing compute, policies that enable users to choose low-power compute, ability to defer compute to cooler times of the day, etc.

Connection Science Projects

Evolution of Security vs Defense vs Deterrence in Geopolitical Cyber Conflict
The costs of adversarial activity on networks are growing at an alarming rate and have reached $1T per year. A common starting point of a systems analysis for protecting a domain, whether it is land, sea, air, space, or cyberspace, is defining the desired end state. What do mean by protecting our shared network domain? A few desired end states are: (1) An international community that observes and enforces norms of responsible state behavior; (2) Public-private partnerships based on a shared awareness and combined action; (3) The proactive observing, pursuing, and countering of adversary operations while reinforcing favorable international norms. A common foundation for all of these is: community. Clearly protecting cyberspace is going to require a community-based approach with the strongest possible regard for privacy. The standard approaches that are used to protect any domain are:
– Deterrence: is the Existence of a credible threat of unacceptable counteraction
-(Walls-Out) Defense: are actions taken to defeat threats that are threatening to breach cyberspace security
-(Walls-In) Security: are actions taken within protected cyberspace to prevent unauthorized access, exploitation, or damage
Security is the dominant investment today and involves identifying and countering actions that are will described by the ATT&CK framework and take place almost entirely within protected cyberspace. Historically, in other domains, (walls-out) defense has been the most cost effective and most de-escalatory approach because threats are defeated before they reach the protected domain, which dramatically reduces the potential benefit of any attack. This project will explore the evolution of the cyber community from security, defense, and deterrence perspectives.
References:
Where to Look – Cyber Sensor Placement and Calibration
Placement and calibration are critical to the effective functioning of any sensor in any domain land, sea, air, space, and cyberspace. Cyber sensors have the potential to provide a wide range of information to protect systems. Furthermore, the potential analytics that can be applied to such information are enormous. However, the effectiveness of any cyber analytic is based on underlying assumptions about the data that are intrinsically tied to placement and calibration of the cyber sensor. Proper placement and calibration can improve cyber capability by as much as 1000x. This project will explore the key assumptions being made in cyber analytics, calibration processes, and the potential placement of cyber sensors with the goal of developing best practices for cyber operators.
References:
Anonymized Network Analysis on the Edge
Long range detection is a cornerstone of defense in many operating domains (land, sea, undersea, air, space, ..,). In the cyber domain, long range detection requires the analysis of significant network traffic from a variety of observatories and outposts with the highest regard for privacy. Construction of anonymized hypersparse traffic matrices on edge network devices can be a key enabler by providing significant data compression in a rapidly analyzable format that protects privacy. GraphBLAS is ideally suited for both constructing and analyzing anonymized hypersparse traffic matrices. This project will explore the deployment of this capability on a variety of edge hardware.
References:
Next Generation Spatial Temporal Cyber Data Products
Internet analysis is a major challenge due to the volume and rate of network traffic. In lieu of analyzing traffic as raw packets, network analysts often rely on compressed network flows (netflows) that contain the start time, stop time, source, destination, and number of packets in each direction. However, many traffic analyses benefit from temporal aggregation of multiple simultaneous netflows, which can be computationally challenging. To alleviate this concern, a novel netflow compression and resampling method has been developed leveraging GraphBLAS hyperspace traffic matrices that preserve anonymization while enabling subrange analysis. Standard multitemporal spatial analyses are then performed on each subrange to generate detailed statistical aggregates of the source packets, source fan-out, unique links, destination fan-in, and destination packets of each subrange which can then be used for background modeling and anomaly detection. A simple file format based on GraphBLAS sparse matrices is developed for storing these statistical aggregates. The resulting compression achieved is significant (<0.1 bit per packet) enabling extremely large netflow analyses to be stored and transported. This project will explore extending this approach to create the next generation of standard cyber data products.
References:
Hyperscale Network Analysis
NoSQL databases are the backbone of many hyperscale web companies (e.g., Google, Amazon, and Facebook). The Apache Accumulo database is the highest performance open source NoSQL database in the world and is widely used for government cyber applications. Obtaining the full benefits of Accumulo requires using novel schemas. The Dynamic Distributed Dimensional Data Model (D4M) provides a uniform mathematical framework based on associative arrays that encompasses SQL and NoSQL databases. For NoSQL databases D4M naturally leads to a general purpose schema that can be used to fully index and rapidly query every unique string in a dataset. This work will focus on optimizing D4M to provide a high performance analytics for cyber data stored in Apache Accumulo.
References:
Unexplained Universalities of the Internet
Heavy-tail distributions, where the probability decays slower than exp(-x), are a natural result of multiplicative processes and play an important role in many of today’s most important problems (pandemics, climate, weather, finance, wealth distribution, social media, …). Computer networks are among the most notable examples of heavy-tail distributions, whose celebrated discovery led to the creation of the new field of Network Science. Since the initial flurry of phenomenological observations and theories, reduction to practice has been slowed by a lack of sufficient data necessary to test detailed models with adequate spatial/temporal diversity.Our recent construction of the largest publicly available network data sets with over 100 trillion events has overcome this obstacle resulting in highly detailed models with 0.1% accuracies over a wide range of locations and timescales. These measurements affirm the ubiquity of heavy-tail distributions in network data and have led to the development of a number of ad-hoc AI approaches for developing precision models that can serve as the basis of wide-area cyber anomaly detection systems. However, this observation brings with it the recognition that many cyber detection systems use light-tail statistical tests (e.g., expected variance) for which there may be no combination of thresholds that can result in acceptable operator probability-of-detection (Pd) and probability-of-false-alarm (Pfa). This Pd/Pfa paradox is consistent with the lived experience of many cyber operators and a possible root cause is the potential incompatibility of light-tail statistical tests on heavy-tail data (see Figure 1). The goal of this effort is to develop the necessary theoretical understanding of heavy-tail distributions within a network context to provide rigorously founded AI approaches to network anomaly detection.
References: