Student Projects
About Us
Our team creates powerful supercomputers for modeling and analyzing the most complex problems and the largest data sets to enable revolutionary discoveries and capabilities. Many of these capabilities have been developed and published in partnership with amazing students (see our Google Scholar page).
Here are a selection of videos describing some of our work:
- MIT Lincoln Laboratory Supercomputing Center
- Large Scale Parallel Sparse Matrix Streaming Graph/Network Analysis [ACM Symposium on
- Parallelism in Algorithms and Architectures (SPAA) Keynote Talk]
- Beyond Zero Botnets: Web3 Enabled Observe-Pursue-Counter Approach [TEDx Boston Studio MIT Imagination in Action
- HPC Server 3D Representation
- Building the Massachusettes Green High Performance Computing Center (the largest open research data center in the world)
Listed below are a wide range protential projects in AI, Mathematics, Green Computing, Supercomputing Systems, Online Learning, and Connection Science. If you are interested in any of these projects, please send us email at supercloud@mit.edu (please avoid using ChatGPT or another LLM to write your email).
AI Projects
- AI Analysis of User Interactions
The SuperCloud team communicates and provides assistance to users via email. What can we learn from the email and Zoom communications to improve SuperCloud and user experience on SuperCloud. We want to build an infrastructure to collect, parse and analyze the email communications to improve SuperCloud and user experience. - Hierarchical Anonymized AI
The costs of adversarial activity on networks are growing at an alarming rate and have reached $1T per year. 90% of Americans are now concerned about cyber-attacks; a level of public concern that is greater than pandemics and nuclear war. In the land, sea, undersea, air, and space operating domains observe-pursue-counter (detect-handoff-intercept) walls-out architectures have proven cost effective. Our recent innovations in high performance privacy-preserving network sensing and analysis offer new opportunities for obtaining the required observations to enable such architectures in the cyber domain. Using these network observations to pursue and counter adversarial activity requires the development of novel privacy-preserving hierarchical AI analytics techniques that explore connections both within and across the layers of the knowledge pyramid from low-level network traffic to high-level social media.
References:- Zero Botnets: An Observe-Pursue-Counter Approach
- Realizing Forward Defense in the Cyber Domain
- GraphBLAS on the Edge: Anonymized High Performance Streaming of Network Traffic
- Large Scale Enrichment and Statistical Cyber Characterization of Network Traffic
- Temporal Correlation of Internet Observatories and Outposts
- Hypersparse Neural Network Analysis of Large-Scale Internet Traffic
- Focusing and Calibration of Large Scale Network Sensors using GraphBLAS Anonymized Hypersparse Matrices
- Mapping of Internet “Coastlines” via Large Scale Anonymized Network Source Correlations
- What is Normal? A Big Data Observational Science Model of Anonymized Internet Traffic
- GraphBLAS
- Novel-precision computing and algorithms for AI accelerators
AI has triggered an explosion of new capabilities for computing with a wide range of numerical precisions. Algorithms that can take advantage of these innovations can be accelerated by many orders of magnitude. Future progress in computation is dependent on the development of these new approaches. This will draw directly from the success seen in training of neural networks and language models of increasing size and complexity. In essence, they apply ever smaller bit counts as the computation permits: inference tasks get less than training, which in turn splits the required precision between gradients requiring potentially large magnitude swings and weights that need more accuracy but tend to use fewer exponent bits. The numerical algorithms need to be developed that make this new mixed precision approach their main focus. In this effort, we will exploit these new hardware floating-point formats to drive the implementations of algorithms of this burgeoning field including numerical methods, their convergence properties, and performance considerations when applied to HPC benchmarking.
Mathematics Projects
- Mathematics of Big Data & Machine Learning
Big Data describes a new era in the digital age where the volume, velocity, and variety of data created across a wide range of fields is increasing at a rate well beyond our ability to analyze the data. Machine Learning has emerged as a powerful tool for transforming this data into usable information. Many technologies (e.g., spreadsheets, databases, graphs, matrices, deep neural networks, …) have been developed to address these challenges. The common theme amongst these technologies is the need to store and operate on data as tabular collections instead of as individual data elements. This project explore the common mathematical foundation of these tabular collections (associative arrays) that apply across a wide range of applications and technologies. Associative arrays unify and simplify Big Data and Machine Learning. Understanding these mathematical foundations enables seeing past the differences that lie on the surface of Big Data and Machine Learning applications and technologies and leverage their core mathematical similarities to solve the hardest Big Data and Machine Learning challenges.
References: - Catastrophe vs Conspiracy: Heavy Tail Statistics
Heavy-tail distributions, where the probability decays slower than exp(-x), are a natural result of multiplicative processes and play an important role in many of today’s most important problems (pandemics, climate, weather, finance, wealth distribution, social media, …). Computer networks are among the most notable examples of heavy-tail distributions, whose celebrated discovery led to the creation of the new field of Network Science. However, this observation brings with it the recognition that many cyber detection systems use light-tail statistical tests for which there may be no combination of thresholds that can result in acceptable operator probability-of-detection (Pd) and probability-of-false-alarm (Pfa). This Pd/Pfa paradox is consistent with the lived experience of many cyber operators and a possible root cause is the potential incompatibility of light-tail statistical tests on heavy-tail data. The goal of this effort is to develop the necessary educational and training tools for effectively understanding and applying heavy-tail distributions in a cyber context.
References:- New Phenomena in Large-Scale Internet Traffic
- Large Scale Enrichment and Statistical Cyber Characterization of Network Traffic
- The Fundamentals of Heavy Tails: Properties, Emergence, and Estimation
- Temporal Correlation of Internet Observatories and Outposts
- Hybrid Power-Law Models of Network Traffic
- Hypersparse Neural Network Analysis of Large-Scale Internet Traffic
- Focusing and Calibration of Large Scale Network Sensors using GraphBLAS Anonymized Hypersparse Matrices
- Mapping of Internet “Coastlines” via Large Scale Anonymized Network Source Correlations
- From Bits to Insights: Exploring Network Traffic, Traffic Matrices, and Heavy-Tailed Data
- Teaching Network Traffic Matrices in an Interactive Game Environment
- Interactive Trillion Packet Anonymized Network Analysis with the GraphBLAS
- Abstract Algebra of Cyberspace
Social media, e-commerce, streaming video, e-mail, cloud documents, web pages, traffic flows, and network packets fill vast digital lakes, rivers, and oceans that we each navigate daily. This digital hyperspace is an amorphous flow of data supported by continuous streams that stretch standard concepts of type and dimension. The unstructured data of digital hyperspace can be elegantly represented, traversed, and transformed via the mathematics of hypergraphs, hypersparse matrices, and associative array algebra. This work will explore a novel mathematical concept, the semilink, that combines pairs of semirings to provide the essential operations for network/graph analytics, database operations, and machine learning.
References: - Mathematical Underpinnings of Associative Array Algebra
Semirings have found success as an algebraic structure which can support the variety of data types and operations used by those working with graphs, matrices, spreadsheets, and database, and form the mathematical foundation of the associativey array algebra of D4M and matrix algebra of GraphBLAS. Mirroring the fact that module theory has many but not all of the structural guarantees of vector space theory, the semimodule theory has some but not all of the structural guarantees of module theory. The added generality of semirings allows semimodule theory to consider structures wholly unlike rings and fields like Boolean algebras and the max-plus algebra. By focusing on these special cases which are diametrically opposed to the traditional ring and field cases, analogs of standard linear algebra like eigenanalysis and solving linear systems. This work will further explore the theory of semirings in the form of solving linear systems, carrying out eigenanalysis, and graph algorithms.
References:- Mathematics of Big Data
- Linear Systems over Join-Blank Algebras
- Algebraic Conditions on One-Step Breadth-First Search
- Eigenvalue Distribution of Max-Plus Random Matrices
- GraphBLAS Mathematical Opportunities: Parallel Hyperspace, Matrix Based Graph Streaming, and Complex-Index Matrices
- Graphs, Dioids and Semirings
- Semirings and their Applications
- GraphBLAS
- D4M
Supercomputing Systems Projects
- MIT/Stanford Next Generation Operating System
The goal of the MIT/Stanford DBOS–the DBMS-oriented Operating System–is to build a completely new operating system (OS) stack for distributed systems. Currently, distributed systems are built on many instances of a single-node OS like Linux with entirely separate cluster schedulers, distributed file systems, and network managers. DBOS uses a distributed transactional DBMS as the basis for a scalable cluster OS. We have shown that such a database OS can do scheduling, file management, and inter-process communication with competitive performance to existing systems. It can additionally provide significantly better analytics and dramatically reduce code complexity by building core OS services from standard database queries, while implementing low-latency transactions and high availability only once. We are currently working on building a complete end-to-end prototype of DBOS. This project will exploring implementing next generation cyber analytics within DBOS.
References:- DBOS: a DBMS-Oriented Operating System
- DBOS
- GraphBLAS on the Edge: Anonymized High Performance Streaming of Network Traffic
- Hypersparse Network Flow Analysis of Packets with GraphBLAS
- Temporal Correlation of Internet Observatories and Outposts
- Python Implementation of the Dynamic Distributed Dimensional Data Model
- Large Scale Enrichment and Statistical Cyber Characterization of Network Traffic
- Hypersparse Neural Network Analysis of Large-Scale Internet Traffic
- Deployment of Real-Time Network Traffic Analysis using GraphBLAS Hypersparse Matrices and D4M Associative Arrays
- High-Performance Computing (HPC) Security: Architecture, Threat Analysis, and Security Posture
- DBOS Network Sensing: A Web Services Approach to Collaborative Awareness
- D4M
- GraphBLAS
- Parallel Python Programming
There are plethora of libraries to enable parallel programming with Python programming language but little has been done with partitioned global array semantics (PGAS) approach. Using PGAS approach, one can deploy a parallel capability that provides good speed-up without sacrificing the ease of programming in Python. This project will explore the scalability and performance of the preliminary implementation of PGAS in Python and compare its performance with other libraries available for Python parallel programming, and potentially seeking further performance optimization in the current PGAS implementation.
References: - 3D Visualization of Supercomputer Performance
There are a number of data collection and visualization tools to assist in the real time performance analysis of High Performance Computing(HPC) systems but there is a need to analyze past performance for systems troubleshooting and system behavior analysis. Optimizing HPC systems for processing speed, power consumption, and network optimization can be difficult to do in real time so a system to use collected data to “rerun” system performance would be advantageous. Gaming engines, like Unity 3D, can be used to build virtual system representations and run scenarios using historical or manufactured data to identify system failures or bottlenecks and fine tuned to optimize performace metrics.
References: - Data Analytics and 3D Game Development
The LLSC operates and maintains a large number of High Performance Computing clusters for general-purpose accelerated discovery across many research domains. The operation of these systems requires access to detailed information regarding the status of systems schedulers, storage arrays, compute node status, network data, and data center conditions. The collections of data represents the collection of over 80 million data points per day. Effectively correlating this volume of data into actionable information requires innovative approaches and tools. The LLSC has developed a 3D Monitoring and Management platform by leveraging Unity3D to render the physical data center space and assets into a virtual environment which strives to provide a holistic view of the HPC resources in a human digestible format. Our goal is to achieve a level of situational awareness that enables the operations team to identify and correct issues before they negatively impact the user experience. Some near term goals are to fold the innovative Green Data Center challenge work and data into the M&M system to enable the identification of carbon impacts of different job profiles across a heterogeneous compute environment.
References:- Large Scale Network Situational Awareness via 3D Gaming Technology
- Big Data Strategies for Data Center Infrastructure Management using a 3D Gaming Platform
- Optimizing the Visualization Pipeline of a 3-D Monitoring and Management System
- 3D Real-Time Supercomputer Monitoring
- A Green(er) World for AI
- Unity game development platform
Education, Training, and Outreach Projects
- Predicting future training needs
The ability to provide necessary documentation and training for our researchers requires that we understand the suite of applications, workflows and software tools currently used and are able to develop insight into future trends in applications, workflow development and software tool selection. To do this we need to collect and analyze data from jobs run on the LLSC-SuperCloud systems, researcher help requests, the educational platform and the user database.
Current projects in this area include determining the necessary data required to provide inside on our training and research support needs and developing a data set. For example, in order to identify education and training gaps and design a prioritized suite of new examples, we need a clear picture of who are users are, what their applications require, how often they use the system, how much of the system they use and what their usage pattern looks like over time. - Evaluating Training Effectiveness
The ETO team is interested in using data driven processes to evaluate the effectiveness of our education and training modules. This research effort is related to evaluating the impact of informal training on the researcher’s HPC understanding and growth. Using data from our courses and researcher use of the supercomputing, do researchers use the supercomputing system effectively? Have they aligned their workflow with the one of the canonical HPC workflows, are they requesting the proper system resources and are they using all that they have requested. For more information see:
Connection Science Projects
- Where to Look – Cyber Sensor Placement and Calibration
Placement and calibration are critical to the effective functioning of any sensor in any domain land, sea, air, space, and cyberspace. Cyber sensors have the potential to provide a wide range of information to protect systems. Furthermore, the potential analytics that can be applied to such information are enormous. However, the effectiveness of any cyber analytic is based on underlying assumptions about the data that are intrinsically tied to placement and calibration of the cyber sensor. Proper placement and calibration can improve cyber capability by as much as 1000x. This project will explore the key assumptions being made in cyber analytics, calibration processes, and the potential placement of cyber sensors with the goal of developing best practices for cyber operators.
References:- Zero Botnets: An Observe-Pursue-Counter Approach
- Realizing Forward Defense in the Cyber Domain
- Temporal Correlation of Internet Observatories and Outposts
- Large Scale Enrichment and Statistical Cyber Characterization of Network Traffic
- Hypersparse Network Flow Analysis of Packets with GraphBLAS
- Rulemaking for Insider Threat Mitigation
- Focusing and Calibration of Large Scale Network Sensors using GraphBLAS Anonymized Hypersparse Matrices
- Anonymized Network Analysis on the Edge
Long range detection is a cornerstone of defense in many operating domains (land, sea, undersea, air, space, ..,). In the cyber domain, long range detection requires the analysis of significant network traffic from a variety of observatories and outposts with the highest regard for privacy. Construction of anonymized hypersparse traffic matrices on edge network devices can be a key enabler by providing significant data compression in a rapidly analyzable format that protects privacy. GraphBLAS is ideally suited for both constructing and analyzing anonymized hypersparse traffic matrices. This project will explore the deployment of this capability on a variety of edge hardware.
References:- GraphBLAS on the Edge: Anonymized High Performance Streaming of Network Traffic
- Hypersparse Network Flow Analysis of Packets with GraphBLAS
- Large Scale Enrichment and Statistical Cyber Characterization of Network Traffic
- Hypersparse Neural Network Analysis of Large-Scale Internet Traffic
- Deployment of Real-Time Network Traffic Analysis using GraphBLAS Hypersparse Matrices and D4M Associative Arrays
- Anonymized Network Sensing Graph Challenge
- Nvidia Data Processing Unit
- GraphBLAS
- Integrated Library for Advancing Network Data Science
- Next Generation Spatial Temporal Cyber Data Products
Internet analysis is a major challenge due to the volume and rate of network traffic. In lieu of analyzing traffic as raw packets, network analysts often rely on compressed network flows (netflows) that contain the start time, stop time, source, destination, and number of packets in each direction. However, many traffic analyses benefit from temporal aggregation of multiple simultaneous netflows, which can be computationally challenging. To alleviate this concern, a novel netflow compression and resampling method has been developed leveraging GraphBLAS hyperspace traffic matrices that preserve anonymization while enabling subrange analysis. Standard multitemporal spatial analyses are then performed on each subrange to generate detailed statistical aggregates of the source packets, source fan-out, unique links, destination fan-in, and destination packets of each subrange which can then be used for background modeling and anomaly detection. A simple file format based on GraphBLAS sparse matrices is developed for storing these statistical aggregates. The resulting compression achieved is significant (<0.1 bit per packet) enabling extremely large netflow analyses to be stored and transported. This project will explore extending this approach to create the next generation of standard cyber data products.
References:- Hypersparse Network Flow Analysis of Packets with GraphBLAS
- GraphBLAS on the Edge: Anonymized High Performance Streaming of Network Traffic
- Large Scale Enrichment and Statistical Cyber Characterization of Network Traffic
- Temporal Correlation of Internet Observatories and Outposts
- Hypersparse Neural Network Analysis of Large-Scale Internet Traffic
- Mapping of Internet “Coastlines” via Large Scale Anonymized Network Source Correlations
- Interactive Trillion Packet Anonymized Network Analysis with the GraphBLAS
- GraphBLAS
- Hyperscale Network Analysis
NoSQL databases are the backbone of many hyperscale web companies (e.g., Google, Amazon, and Facebook). The Apache Accumulo database is the highest performance open source NoSQL database in the world and is widely used for government cyber applications. Obtaining the full benefits of Accumulo requires using novel schemas. The Dynamic Distributed Dimensional Data Model (D4M) provides a uniform mathematical framework based on associative arrays that encompasses SQL and NoSQL databases. For NoSQL databases D4M naturally leads to a general purpose schema that can be used to fully index and rapidly query every unique string in a dataset. This work will focus on optimizing D4M to provide a high performance analytics for cyber data stored in Apache Accumulo.
References:- Python Implementation of the Dynamic Distributed Dimensional Data Model
- Hypersparse Network Flow Analysis of Packets with GraphBLAS
- Large Scale Enrichment and Statistical Cyber Characterization of Network Traffic
- Temporal Correlation of Internet Observatories and Outposts
- Vertical, Temporal, and Horizontal Scaling of Hierarchical Hypersparse GraphBLAS Matrices
- Hypersparse Neural Network Analysis of Large-Scale Internet Traffic
- Mapping of Internet “Coastlines” via Large Scale Anonymized Network Source Correlations
- Anonymized Network Sensing Graph Challenge
- Interactive Trillion Packet Anonymized Network Analysis with the GraphBLAS
- D4M 2.0 Schema: a General Purpose High Performance Schema for the Accumulo Database
- D4M
- Apache Accumulo
- GraphBLAS
- Unexplained Universalities of the Internet
Heavy-tail distributions, where the probability decays slower than exp(-x), are a natural result of multiplicative processes and play an important role in many of today’s most important problems (pandemics, climate, weather, finance, wealth distribution, social media, …). Computer networks are among the most notable examples of heavy-tail distributions, whose celebrated discovery led to the creation of the new field of Network Science. Since the initial flurry of phenomenological observations and theories, reduction to practice has been slowed by a lack of sufficient data necessary to test detailed models with adequate spatial/temporal diversity.Our recent construction of the largest publicly available network data sets with over 100 trillion events has overcome this obstacle resulting in highly detailed models with 0.1% accuracies over a wide range of locations and timescales. These measurements affirm the ubiquity of heavy-tail distributions in network data and have led to the development of a number of ad-hoc AI approaches for developing precision models that can serve as the basis of wide-area cyber anomaly detection systems. However, this observation brings with it the recognition that many cyber detection systems use light-tail statistical tests (e.g., expected variance) for which there may be no combination of thresholds that can result in acceptable operator probability-of-detection (Pd) and probability-of-false-alarm (Pfa). This Pd/Pfa paradox is consistent with the lived experience of many cyber operators and a possible root cause is the potential incompatibility of light-tail statistical tests on heavy-tail data (see Figure 1). The goal of this effort is to develop the necessary theoretical understanding of heavy-tail distributions within a network context to provide rigorously founded AI approaches to network anomaly detection.
References:- New Phenomena in Large-Scale Internet Traffic
- Large Scale Enrichment and Statistical Cyber Characterization of Network Traffic
- Temporal Correlation of Internet Observatories and Outposts
- Spatial Temporal Analysis of 40,000,000,000,000 Internet Darkspace Packets
- Hybrid Power-Law Models of Network Traffic
- Multi-Temporal Analysis and Scaling Relations of 100,000,000,000 Network Packets
- Hypersparse Neural Network Analysis of Large-Scale Internet Traffic
- Focusing and Calibration of Large Scale Network Sensors using GraphBLAS Anonymized Hypersparse Matrices
- Mapping of Internet “Coastlines” via Large Scale Anonymized Network Source Correlations
- What is Normal? A Big Data Observational Science Model of Anonymized Internet Traffic
