Student Projects

About Us

Our team creates powerful supercomputers for modeling and analyzing the most complex problems and the largest data sets to enable revolutionary discoveries and capabilities.  Many of these capabilities have been developed and published in partnership with amazing students (see our Google Scholar page).

Here are a selection of videos describing some of our work:
• MIT Lincoln Laboratory Supercomputing Center
• Large Scale Parallel Sparse Matrix Streaming Graph/Network Analysis [ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) Keynote Talk]
• Beyond Zero Botnets: Web3 Enabled Observe-Pursue-Counter Approach [TEDx Boston Studio MIT Imagination in Action
• HPC Server 3D Representation
• Building the Massachusettes Green High Performance Computing Center (the largest open research data center in the world)

Listed below are a wide range protential projects in AI, Mathematics, Green Computing, Supercomputing Systems, Online Learning, and Connection Science.  If you are interested in any of these projects, please send us email at supercloud@mit.edu (please avoid using ChatGPT or another LLM to write your email).

AI Projects

Mathematics Projects

  • Mathematics of Big Data & Machine Learning
    Big Data describes a new era in the digital age where the volume, velocity, and variety of data created across a wide range of fields is increasing at a rate well beyond our ability to analyze the data.  Machine Learning has emerged as a powerful tool for transforming this data into usable information.  Many technologies (e.g., spreadsheets, databases, graphs, matrices, deep neural networks, ...) have been developed to address these challenges.  The common theme amongst these technologies is the need to store and operate on data as tabular collections instead of as individual data elements.  This project explore the common mathematical foundation of these tabular collections (associative arrays) that apply across a wide range of applications and technologies.  Associative arrays unify and simplify Big Data and Machine Learning.  Understanding these mathematical foundations enables seeing past the differences that lie on the surface of Big Data and Machine Learning applications and technologies and leverage their core mathematical similarities to solve the hardest Big Data and Machine Learning challenges.
    References:
    • Mathematics of Big Data

  • Catastrophe vs Conspiracy: Heavy Tail Statistics
    Heavy-tail distributions, where the probability decays slower than exp(-x), are a natural result of multiplicative processes and play an important role in many of today’s most important problems (pandemics, climate, weather, finance, wealth distribution, social media, …).     Computer networks are among the most notable examples of heavy-tail distributions, whose celebrated discovery led to the creation of the new field of Network Science.   However, this observation brings with it the recognition that many cyber detection systems use light-tail statistical tests for which there may be no combination of thresholds that can result in acceptable operator probability-of-detection (Pd) and probability-of-false-alarm (Pfa). This Pd/Pfa paradox is consistent with the lived experience of many cyber operators and a possible root cause is the potential incompatibility of light-tail statistical tests on heavy-tail data. The goal of this effort is to develop the necessary educational and training tools for effectively understanding and applying heavy-tail distributions in a cyber context.
    References:
    New Phenomena in Large-Scale Internet Traffic
    Large Scale Enrichment and Statistical Cyber Characterization of Network Traffic
    The Fundamentals of Heavy Tails: Properties, Emergence, and Estimation
    Temporal Correlation of Internet Observatories and Outposts
    Hybrid Power-Law Models of Network Traffic
    Hypersparse Neural Network Analysis of Large-Scale Internet Traffic
    • Focusing and Calibration of Large Scale Network Sensors using GraphBLAS Anonymized Hypersparse Matrices
    • Mapping of Internet “Coastlines” via Large Scale Anonymized Network Source Correlations
     

  • Abstract Algebra of Cyberspace
    Social media, e-commerce, streaming video, e-mail, cloud documents, web pages, traffic flows, and network packets fill vast digital lakes, rivers, and oceans that we each navigate daily. This digital hyperspace is an amorphous flow of data supported by continuous streams that stretch standard concepts of type and dimension. The unstructured data of digital hyperspace can be elegantly represented, traversed, and transformed via the mathematics of hypergraphs, hypersparse matrices, and associative array algebra. This work will explore a novel mathematical concept, the semilink, that combines pairs of semirings to provide the essential operations for network/graph analytics, database operations, and machine learning.
    References:
    Mathematics of Big Data
    Mathematics of Digital Hyperspace
    Visually Representing the Landscape of Mathematical Structures
    Polystore Mathematics of Relational Algebra

  • Mathematical Underpinnings of Associative Array Algebra
    Semirings have found success as an algebraic structure which can support the variety of data types and operations used by those working with graphs, matrices, spreadsheets, and database, and form the mathematical foundation of the associativey array algebra of D4M and matrix algebra of GraphBLAS. Mirroring the fact that module theory has many but not all of the structural guarantees of vector space theory, the semimodule theory has some but not all of the structural guarantees of module theory. The added generality of semirings allows semimodule theory to consider structures wholly unlike rings and fields like Boolean algebras and the max-plus algebra. By focusing on these special cases which are diametrically opposed to the traditional ring and field cases, analogs of standard linear algebra like eigenanalysis and solving linear systems. This work will further explore the theory of semirings in the form of solving linear systems, carrying out eigenanalysis, and graph algorithms.
    References:
    Mathematics of Big Data
    Linear Systems over Join-Blank Algebras
    Graphs, Dioids and Semirings
    Semirings and their Applications
    GraphBLAS
    D4M

Green Computing Projects

  • Reducing the Carbon Footprint of AI
    AI is increasingly being used in a variety of domains ranging from vision, speech, cyber, bio and many more. With data and models getting ever larger, the associated carbon footprint of training and inference operations of these models is also increasing. While there is an increasing awareness in the community, no significant effort to reduce the energy consumed by AI currently exists. This project aims to develop best practices and recommendations for reducing the carbon footprint of AI through several approaches, including but not limited to
    · Hardware performance modulation
    · Optimized/energy efficient training and inference pipelines
    · Optimized network architecture searches (NAS) that require less compute
    · Trainless NAS
    · Quantization, distillation, low-precision compute
    References:
    • Energy-Aware Neural Architecture Selection and Hyperparameter Optimization
    • Loss Curve Approximations for Fast Neural Architecture Ranking & Training Elasticity Estimation
    • From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference
    • Kairos: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources
    • Sustainable HPC: Modeling, Characterization, and Implications of Carbon Footprint in Modern HPC Systems
    • Green Carbon Footprint for Model Inference Serving via Exploiting Mixed-Quality Models and GPU Partitioning

  • Optimizing Datacenter Operations
    Datacenters are an increasing source of energy consumption and the global carbon footprint. The energy consumed in a datacenter consists not only of the electricity required to the actual hardware, but also the energy expenditures for cooling the hardware. While some commercial cloud providers are able to locate datacenters in areas with abundant carbon-free, renewable energy, this is not the case in all cases. Thus, there is a need to reduce the carbon footprint of running these massive datacenters that provide compute for a wide variety of compute workloads, ranging from AI/ML to scientific applications, as well as e-commerce.  By optimizing the use of compute inside the datacenter, it may be possible to the reduce the power usage efficiency (PUE) of the datacenter, which can help reduce the carbon footprint significantly. Approaches to enable this goal include
    · Matching workloads with compute capability – by understanding the compute characteristics of workloads being run at datacenter scale, it may be possible to run workloads on the hardware that is best suited for the task and is most energy efficient.
    · Reconfigurable supercomputing – by characterizing the diversity of datacenter workloads, operators may be able to take advantage of advanced hardware that allows optimal provisioning and partitioning on a weekly basis. This can accommodate diverse workloads optimally based on prior behavior, while reducing energy consumption and also improving hardware availability.
    · Energy modeling – the mix of clean/renewable energy varies by region as well as by the time of year. By modeling energy demand at the datacenter and the seasonal mix of renewables, it may be possible for supercomputing centers to reduce their carbon footprint by shifting compute to “clean” periods.
    · Climate aware scheduling – current workload schedulers in datacenters are climate agnostic and do not take into account environmental factors when running workloads. Additionally, schedulers do not take into account the physical organization/layout of the hardware in the facility while running user submitted jobs. Potential approaches to address this are as follows

  • Scheduling jobs based on external temperature
    Any hardware operating at peak performance generates significant amount of heat, which necessitates the use of large amounts of cooling in the datacenter. This is in addition to external temperatures over which operators have no control. Thus, if the amount of heat generated inside the datacenter is reduced, this could have a measurable positive impact on the amount of cooling required. On way to achieve this is to schedule power-hungry compute such as GPU compute to cooler parts of the day. Another approach is to enable the scheduler to take into account the physical proximity of hardware while scheduling compute jobs. This can help reduce hot-spots in the server racks and potentially lead to more efficient cooling operations.

  • Climate modeling
    Supercomputing center operators can build localized climate models to estimate events such as heatwaves and enable the scheduler to reduce hardware power in the event of anticipated extreme weather. This will have the effect of not only reducing energy consumption, but also ensuring safer and un-interrupted datacenter operations by avoiding potential cooling failures in extreme heat.

  • User Engagement
    Supercomputing operations can be further optimized by actively working with users who are running large scale compute in the datacenter. This can take the form of education for optimizing compute, policies that enable users to choose low-power compute, ability to defer compute to cooler times of the day, etc.

Supercomputing Systems Projects

  • MIT/Stanford Next Generation Operating System
    The goal of the MIT/Stanford DBOS–the DBMS-oriented Operating System–is to build a completely new operating system (OS) stack for distributed systems. Currently, distributed systems are built on many instances of a single-node OS like Linux with entirely separate cluster schedulers, distributed file systems, and network managers. DBOS uses a distributed transactional DBMS as the basis for a scalable cluster OS. We have shown that such a database OS can do scheduling, file management, and inter-process communication with competitive performance to existing systems. It can additionally provide significantly better analytics and dramatically reduce code complexity by building core OS services from standard database queries, while implementing low-latency transactions and high availability only once. We are currently working on building a complete end-to-end prototype of DBOS.  This project will exploring implementing next generation cyber analytics within DBOS.
    References:
    DBOS: a DBMS-Oriented Operating System
    DBOS
    GraphBLAS on the Edge: Anonymized High Performance Streaming of Network Traffic
    Hypersparse Network Flow Analysis of Packets with GraphBLAS
    Temporal Correlation of Internet Observatories and Outposts
    Python Implementation of the Dynamic Distributed Dimensional Data Model
    Large Scale Enrichment and Statistical Cyber Characterization of Network Traffic
    Hypersparse Neural Network Analysis of Large-Scale Internet Traffic
    • Deployment of Real-Time Network Traffic Analysis using GraphBLAS Hypersparse Matrices and D4M Associative Arrays
    • High-Performance Computing (HPC) Security: Architecture, Threat Analysis, and Security Posture
    D4M
    GraphBLAS

  • Supercomputing and Cloud interoperability
    Shared Supercomputing typically resources offer a limited set of hardware and software combinations that researchers can leverage. At the same time, the absolute number of resources offered are also limited physically. Commercial cloud providers can offer an avenue to leverage additional resources as well as new/unique hardware as needs arise. Thus, having the technology to seamlessly transition between a shared resource such as the MIT SuperCloud and a commercial cloud provider (AWS, Microsoft Azure, Google Cloud) can significantly increase user productivity and enable new research. This project aims to
    · Make the MIT SuperCloud software stack available as a deployable image on commercial cloud providers
    · Develop tools to seamlessly transition between SuperCloud and cloud as requirements change
    · Enable sponsors and funding agencies to provide a standard AI stack that can be leveraged by performers and the broader community

  • Performance Tuning of Large-Scale Cluster Management Systems
    Modern supercomputers rely on a collection of open-source and bespoke custom software to handle node and user provisioning, system configuration, configuration persistence, change management, monitoring and metrics gathering and imaging.  The MIT SuperCloud system’s routine monthly maintenance includes a full reimage and reinstall of the operating system, all software and configuration files to ensure the reliability of our imaging system as well as to maintain a consistent state for our users, preventing the accumulation of incidental changes which could complicate troubleshooting and interfere with the running of user jobs.  The frequency with which we reimage nodes necessitates that the process be streamlined and optimized such that node reinstallation is as quick and reliable as possible.  This project would explore methods to refine our node installation procedures and search for new efficiencies furthering our ability to manage and maintain very large systems.

  • Datacenteric AI
    The Datacentric AI project aims to develop revolutionary data centric systems that can enable edge-to-datacenter scale computing while also providing high performance and accuracy for AI tasks, high productivity for AI developers using the system, and self-driven resource management of underlying complex systems. Rapidly evolving technologies such as new computing architectures, AI frameworks, supercomputing systems, cloud, and data management are the key enablers of AI and the speed at which they develop is outpacing the ability of AI practitioners to leverage optimally. As this compute capability, AI frameworks, and data diversity have grown, AI models have also evolved from traditional feed-forward or convolutional networks that employ computationally simple layers to more complex networks that use differential equations to model physical phenomenon. These new classes of algorithms and massive model architectures need new types of data-centric systems that can help map the novel computing requirements with ever complex hardware platforms such as quantum processors, neuromorphic processors and datacenter scale chips. A data-centric system would need revolutionary operating systems, ML-enhanced data management, highly parallel algorithms, and workload-aware schedulers that can automatically map workloads to heterogenous hardware platforms. By developing technologies to address these needs of future AI systems, this project aims to provide Lincoln and DoD researchers with the tools to address the needs of future AI systems.

  • Parallel Python Programming
    There are plethora of libraries to enable parallel programming with Python programming language but little has been done with partitioned global array semantics  (PGAS) approach.  Using PGAS approach, one can deploy a parallel capability that provides good speed-up without sacrificing the ease of programming in Python. This project will explore the scalability and performance of the preliminary implementation of PGAS in Python and compare its performance with other libraries available for Python parallel programming, and potentially seeking further performance optimization in the current PGAS implementation.
    References:
    pPython for Parallel Python Programming
    • pPython Performance Study

  • 3D Visualization of Supercomputer Performance
    There are a number of data collection and visualization tools to assist in the real time performance analysis of High Performance Computing(HPC) systems but there is a need to analyze past performance for systems troubleshooting and system behavior analysis. Optimizing HPC systems for processing speed, power consumption, and network optimization can be difficult to do in real time so a system to use collected data to “rerun” system performance would be advantageous. Gaming engines, like Unity 3D, can be used to build virtual system representations and run scenarios using historical or manufactured data to identify system failures or bottlenecks and fine tuned to optimize performace metrics.
    References:
    3D Real-Time Supercomputer Monitoring
    Large Scale Network Situational Awareness via 3D Gaming Technology

  • Data Analytics and 3D Game Development
    The LLSC operates and maintains a large number of High Performance Computing clusters for general-purpose accelerated discovery across many research domains. The operation of these systems requires access to detailed information regarding the status of systems schedulers, storage arrays, compute node status, network data, and data center conditions. The collections of data represents the collection of over 80 million data points per day. Effectively correlating this volume of data into actionable information requires innovative approaches and tools. The LLSC has developed a 3D Monitoring and Management platform by leveraging Unity3D to render the physical data center space and assets into a virtual environment which strives to provide a holistic view of the HPC resources in a human digestible format. Our goal is to achieve a level of situational awareness that enables the operations team to identify and correct issues before they negatively impact the user experience. Some near term goals are to fold the innovative Green Data Center challenge work and data into the M&M system to enable the identification of carbon impacts of different job profiles across a heterogeneous compute environment.
    References:
    Large Scale Network Situational Awareness via 3D Gaming Technology
    Big Data Strategies for Data Center Infrastructure Management using a 3D Gaming Platform
    Optimizing the Visualization Pipeline of a 3-D Monitoring and Management System
    3D Real-Time Supercomputer Monitoring
    A Green(er) World for AI
    Unity game development platform

Education, Training, and Outreach Projects

  • Evaluate the State of User Applications
    • Capture the most commonly used applications/workflows
    • Compare to existing set of teaching examples
    • Design a prioritized suite of new examples or updates to existing examples
    • Highlight topics that would make good micro-lessons

  • Expand the Suite of Teaching Examples
    • Use the prioritized list created via project defined above or start with code snippets that we currently have
    • Identify areas/scripts that would be beneficial to users:
    • Build whole learning module, e.g. tensorboard
    • documentation converted to scripts (where possible)
    • clean up scripts – (ascii art) to make it clearer for user
    • testing
    • include description of what is happening on systems
    • include information on impact to system and user
    • create mini-workshop
    • convert workshop to video & hands-on module

  • Evaluation of Educational Games
    • Explore the literature to understand methods for evaluating learning
    • Evaluate our HPC Games
    • What data should we collect? survey design?  is there any data that we should look for in the online version?
    • recommend a formal education plan for each game
    References:
    • A Data Driven Approach to Informal HPC Training Evaluation
  • Knowledge Graph from Video Scripts
    • NLP
    • align with content in course(s)

  • Learning Analytics (WPLA - workplace learning analytics)
    • review of  WPLA literature, best practices
    • using SLURM data and Edly Data to evaluate effectiveness of courseware
    • path from Jupyter NB to batch jobs
    • efficiency of jobs
    • email requests aligned with learning modules - do we lower the amount of simple email?

  • Exploring User Data
    • Prior experience
    • Departments, position, affiliation
    • How long are they using the system for? “Turnover”?
    • Combine with Slurm data- how much are people using the system? Correlations between system use and how long they’ve had their account?
     

Connection Science Projects