Big Data & Tools with NoSQL
  • Big Data & Tools
  • ReadMe
  • Big Data Overview
    • Overview
    • Job Opportunities
    • What is Data?
    • How does it help?
    • Types of Data
    • The Big 4 V's
      • Variety
      • Volume
      • Velocity
      • Veracity
      • Other V's
    • Trending Technologies
    • Big Data Concerns
    • Big Data Challenges
    • Data Integration
    • Scaling
      • CAP Theorem
      • Optimistic concurrency
      • Eventual consistency
      • Concurrent vs. Parallel Programming
    • Big Data Tools
    • No SQL Databases
    • What does Big Data learning means?
  • Linux & Tools
    • Overview
    • Linux Commands - 01
    • Linux Commands - 02
    • AWK
    • CSVKIT
    • CSVSQL
    • CSVGREP
  • Data Format
    • Storage Formats
    • CSV/TSV/Parquet
    • Parquet Example
    • JSON
    • HTTP & REST API
      • Terms to Know
        • Statefulness
        • Statelessness
        • Monolithic Architecture
        • Microservices
        • Idempotency
    • REST API
    • Python
      • Setup
      • Decorator
      • Unit Testing
      • Flask Demo
      • Flask Demo - 01
      • Flask Demo - 02
      • Flask Demo - 03
      • Flask Demo - 04
      • Flask Demo - 06
    • API Testing
    • Flask Demo Testing
    • API Performance
    • API in Big Data World
  • NoSQL
    • Types of NoSQL Databases
    • Redis
      • Overview
      • Terms to know
      • Redis - (RDBMS) MySql
      • Redis Cache Demo
      • Use Cases
      • Data Structures
        • Strings
        • List
        • Set
        • Hash
        • Geospatial Index
        • Pub/Sub
        • Redis - Python
      • Redis JSON
      • Redis Search
      • Persistence
      • Databases
      • Timeseries
    • Neo4J
      • Introduction
      • Neo4J Terms
      • Software
      • Neo4J Components
      • Hello World
      • Examples
        • MySQL: Neo4J
        • Sample Transactions
        • Sample
        • Create Nodes
        • Update Nodes
        • Relation
        • Putting it all together
        • Commonly used Functions
        • Data Profiling
        • Queries
        • Python Scripts
      • More reading
    • MongoDB
      • Sample JSON
      • Introduction
      • Software
      • MongoDB Best Practices
      • MongoDB Commands
      • Insert Document
      • Querying MongoDB
      • Update & Remove
      • Import
      • Logical Operators
      • Data Types
      • Operators
      • Aggregation Pipeline
      • Further Reading
      • Fun Task
        • Sample
    • InfluxDB
      • Data Format
      • Scripts
  • Python
    • Python Classes
    • Serialization-Deserialization
  • Tools
    • JQ
    • DUCK DB
    • CICD Intro
    • CICD Tools
      • CI YAML
      • CD Yaml
    • Containers
      • VMs or Containers
      • What container does
      • Podman
      • Podman Examples
  • Cloud Everywhere
    • Overview
    • Types of Cloud Services
    • Challenges of Cloud Computing
    • High Availability
    • Azure Cloud
      • Services
      • Storages
      • Demo
    • Terraform
  • Data Engineering
    • Batch vs Streaming
    • Kafka
      • Introduction
      • Kafka Use Cases
      • Kafka Software
      • Python Scripts
      • Different types of Streaming
    • Quality & Governance
    • Medallion Architecture
    • Data Engineering Model
    • Data Mesh
  • Industry Trends
    • Roadmap - Data Engineer
    • Good Reads
      • IP & SUBNET
Powered by GitBook
On this page
  1. NoSQL
  2. MongoDB

Aggregation Pipeline

PreviousOperatorsNextFurther Reading

Last updated 1 year ago

MongoDB is not just adding and filtering data using find()

Aggregation is a way of processing a large number of documents in a collection by means of passing them through different stages.

The stages make up what is known as a pipeline. The stages in a pipeline can filter, sort, group, reshape and modify documents that pass through the pipeline

  • $match stage – filters those documents we need to work with, those that fit our needs

  • $group stage – does the aggregation job

  • $sort stage – sorts the resulting documents the way we require (ascending or descending)

use sampleagg;

db.states.insertMany(
[
  { "_id": "NY", "name": "New York", "population": 19453561, "rank": 4 },
  { "_id": "CA", "name": "California", "population": 39512223, "rank": 1 },
  { "_id": "TX", "name": "Texas", "population": 28995881, "rank": 2 },
  { "_id": "NJ", "name": "New Jersey", "population": 8995881, "rank": 3 }
]
)

db.universities.insertMany(
[
  { "name": "Columbia University", "stateId": "NY", "year": 2020, "enrolled": 31000, "graduated": 7000 },
  { "name": "New York University", "stateId": "NY", "year": 2020, "enrolled": 51123, "graduated": 12000 },
  { "name": "University of California, Berkeley", "stateId": "CA", "year": 2020, "enrolled": 43000, "graduated": 10000 },
  { "name": "Stanford University", "stateId": "CA", "year": 2020, "enrolled": 17400, "graduated": 4000 },
  { "name": "University of Texas at Austin", "stateId": "TX", "year": 2020, "enrolled": 51000, "graduated": 11000 },
  { "name": "Rowan University", "stateId": "NJ", "year": 2020, "enrolled": 19500, "graduated": 4300 },
  { "name": "Rutgers University", "stateId": "NJ", "year": 2020, "enrolled": 70645, "graduated": 17600 }
]
)


db.universities.find()

// Step 1
db.universities.aggregate([
  {
    $match: {
      "year": 2020  // Focus on the year 2020
    }
  },
  
  {
     $project: {
     _id:1,
     name:1,
     stateId:1,
     year:1,
     enrolled:1,
     graduated:1
    }
  }
])

// Step 2

// Group by State, and do a sum of Enrolled, Graduated
// In MongoDB count is not part of Grouping.

db.universities.aggregate([
  {
    $match: {
      "year": 2020  // Focus on the year 2020
    }
  },
  {
    $group: {
      _id: "$stateId",  // Group by state ID
      totalEnrolled: { $sum: "$enrolled" },  // Sum of all students enrolled
      totalGraduated: { $sum: "$graduated" },  // Sum of all graduates
      universitiesCount: { $sum: 1 }  // Count universities
    }
  },
  
  {
     $project: {
     _id:1,
     totalEnrolled:1,
     totalGraduated:1,
     universitiesCount:1
    }
  }
])

// Step 3

// Join with State Documents

db.universities.aggregate([
  {
    $match: {
      "year": 2020  // Focus on the year 2020
    }
  },
  {
    $group: {
      _id: "$stateId",  // Group by state ID
      totalEnrolled: { $sum: "$enrolled" },  // Sum of all students enrolled
      totalGraduated: { $sum: "$graduated" },  // Sum of all graduates
      universitiesCount: { $sum: 1 }  // Count universities
    }
  },
  {
      $lookup:{
          from:"states",
          localField:"_id",
          foreignField: "_id",
          as:"stateDetails"
      }
  },
  {
     $project: {
     _id:1,
     totalEnrolled:1,
     totalGraduated:1,
     universitiesCount:1,
     stateDetails:1
    }
  }
])


// Step 4
// Final Query
// is used to deconstruct the stateDetails array resulting from the $lookup, 
// making it easier to work with the state details in subsequent stages.

db.universities.aggregate([
  {
    $match: {
      "year": 2020  // Focus on the year 2020
    }
  },
 
  {
    $group: {
      _id: "$stateId",  // Group by state ID
      totalEnrolled: { $sum: "$enrolled" },  // Sum of all students enrolled
      totalGraduated: { $sum: "$graduated" },  // Sum of all graduates
      universitiesCount: { $sum: 1 }  // Count universities
    }
  },
  {
      $lookup:{
          from:"states",
          localField:"_id",
          foreignField: "_id",
          as:"stateDetails"
      }
  },
  {
      $unwind: "$stateDetails"
  },
  {
     $project: {
      _id: 1,
      totalEnrolled:1,
      totalGraduated:1,
      universitiesCount:1,
      "stateName": "$stateDetails.name",
      "statePopulation": "$stateDetails.population",
      "stateRank": "$stateDetails.rank",
    }
  }
])