ACTIVE January 20, 2024

Google Maps Review Scraper

A TypeScript npm module that scrapes reviews from Google Maps places using reverse-engineered Google Maps APIs

Problem Statement

Extracting reviews from Google Maps places manually is time-consuming and inefficient. Developers and businesses need a programmatic way to collect review data for analysis, monitoring, or integration into other applications.

Solution Architecture

The Google Maps Review Scraper reverse-engineers Google Maps’ internal APIs to extract review data without using official Google APIs (which don’t provide comprehensive review access). The scraper follows this architecture:

Solution Architecture

Technical Implementation

Core Components

  1. URL Parser (index.ts): Validates and extracts Place ID from Google Maps URLs
  2. Session Token Fetcher (extraction.ts): Retrieves authentication tokens from Google Maps pages
  3. Review API Client (utils.ts): Makes requests to Google Maps’ internal review endpoints
  4. Pagination Handler (utils.ts): Manages continuation tokens for fetching multiple pages
  5. Data Parser (parser.ts): Transforms raw API responses into structured review objects

Detailed Workflow

  1. URL Validation & Place ID Extraction

    • Validates the input URL format using URL constructor
    • Ensures host includes “google.com” (strictly desktop web version)
    • Extracts the Place ID using regex pattern matching (!1s([a-zA-Z0-9_:]+)!)
    • Example: From https://maps.google.com/maps/place/ChIJN1t_tDeuEmsRUsoyG83frY4 extracts ChIJN1t_tDeuEmsRUsoyG83frY4
  2. Session Authentication

    • Fetches the Google Maps place page at https://maps.google.com/maps/place/{placeId}?hl=en&gl=US
    • Extracts the kEI token from JavaScript variable using html.split("var kEI='")[1]?.split("'")[0]
    • This token acts as a session identifier for subsequent API requests
  3. Review Data Fetching

    • Makes GET requests to https://www.google.com/maps/rpc/listugcposts (built by listugcposts function)
    • Includes required parameters:
      • Place ID
      • Sort type (1=Most Relevant, 2=Newest, 3=Highest Rating, 4=Lowest Rating)
      • Search query (optional, encoded as !3s{query})
      • Session token (kEI)
      • Pagination token (Base64 encoded page number)
    • Handles Google’s XSSI protection by stripping )]}' prefix from responses before JSON parsing
  4. Pagination Handling

    • Processes continuation tokens from API responses (found in response[1])
    • Continues fetching until:
      • Requested page limit is reached (specified by pages parameter)
      • No more reviews are available (empty response[2] array)
      • No new continuation token is returned (indicating end of results)
  5. Data Parsing & Structuring

    • Transforms raw array-based API responses into structured objects using parser function
    • Extracts nested data including:
      • Review metadata (ID, timestamps for published/last_edited)
      • Author information (name, profile URLs, profile page URL, author ID)
      • Review content (rating as number, text content, language code)
      • Images (array with ID, URL, dimensions, location data, caption)
      • Owner responses (if any, with text and timestamps)

Key Implementation Details

  • HTTP Client: Uses impit for efficient HTTP requests with connection pooling
  • Cookie Management: Leverages tough-cookie for handling session cookies automatically
  • Rate Limiting: Built-in 2-second delay between initial requests to avoid detection
  • Error Handling: Comprehensive error handling with graceful degradation
  • Type Safety: Full TypeScript support with explicit type definitions

Data Flow Diagram

Data Flow Diagram

Dependencies

Production Dependencies

  • impit (@apify/impit): Lightweight HTTP client for making requests with automatic cookie handling
  • tough-cookie (@salesforce/tough-cookie): Robust cookie handling for session management

Development Dependencies

  • TypeScript: For type-safe development
  • @types/node: TypeScript definitions for Node.js
  • @types/tough-cookie: TypeScript definitions for tough-cookie
  • rimraf: For removing dist/ directory before builds
  • tsx: For TypeScript execution

Technical Limitations & Considerations

Rate Limiting

  • Built-in 2-second delay between initial requests to avoid detection
  • Respects reasonable usage patterns
  • May require additional delays for high-volume scraping

HTML Structure Dependency

  • Relies on specific Google Maps HTML structure for token extraction (regex pattern !1s([a-zA-Z0-9_:]+)!)
  • May break if Google significantly changes their frontend
  • Internal API endpoints may change over time (observed endpoint: https://www.google.com/maps/rpc/listugcposts)
  • Intended for educational and proof-of-concept purposes
  • Users must comply with Google’s Terms of Service
  • Not intended for large-scale commercial scraping operations
  • Should be used responsibly to avoid IP blocking

Data Completeness

  • May not capture all reviews due to Google’s personalization
  • Some reviews might be filtered based on location or account status
  • Response data may be incomplete for very old reviews

Use Cases

Business Applications

  • Competitive Analysis: Monitor competitor reviews and ratings
  • Customer Experience: Track sentiment changes over time
  • Market Research: Analyze trends in customer feedback
  • Reputation Management: Identify and respond to negative feedback promptly

Technical Applications

  • Data Enrichment: Enhance business directories with review data
  • Review Aggregation: Collect reviews from multiple sources
  • Sentiment Analysis: Feed data into NLP models for opinion mining
  • API Development: Build custom review-based services

Research Applications

  • Academic Research: Study consumer behavior patterns
  • Social Science: Analyze geographic distribution of satisfaction
  • Linguistics: Study language patterns in reviews
  • Economics: Correlate reviews with business performance metrics

Architecture Deep Dive

Module Responsibilities

index.ts (Entry Point)

  • Orchestrates the entire scraping process
  • Validates inputs and handles errors
  • Coordinates between subsystems
  • Provides the public API interface

extraction.ts (Session Management)

  • Handles authentication with Google Maps
  • Extracts session tokens from page content
  • Manages the initial HTTP request to establish context

utils.ts (Core Logic)

  • Implements parameter validation
  • Manages API communication with Google’s endpoints
  • Handles pagination through continuation tokens
  • Coordinates data fetching across multiple pages

parser.ts (Data Transformation)

  • Converts Google’s internal array format to structured objects
  • Extracts nested data safely with null checks
  • Maps API responses to meaningful review properties
  • Filters out malformed or incomplete data entries

types.ts (Type Definitions)

  • Defines TypeScript interfaces for all data structures
  • Ensures type safety throughout the application
  • Provides clear contracts between modules
  • Documents expected data shapes for consumers

Security Considerations

  • No sensitive data storage (tokens are ephemeral)
  • All requests made client-side (no server component)
  • Respects robots.txt through rate limiting
  • No attempt to bypass security measures beyond standard session handling

Performance Characteristics

Time Complexity

  • O(n) where n is the number of pages requested
  • Each page request involves one HTTP call
  • Parsing complexity is linear with review count

Space Complexity

  • O(r) where r is the total number of reviews retrieved
  • Memory usage scales linearly with collected data
  • Intermediate buffers are released after processing

Network Efficiency

  • Connection reuse through HTTP keep-alive
  • Minimal header overhead
  • Efficient cookie management
  • Batched processing where possible

Testing & Reliability

Code Quality Features

  • Parameter validation logic
  • URL parsing and place ID extraction
  • Session token extraction from HTML
  • Pagination logic and continuation handling
  • Data parsing and transformation
  • Error handling pathways

Reliability Features

  • Graceful degradation on partial failures
  • Detailed error logging for debugging
  • Validation at each processing stage

License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer

This project is not affiliated with, endorsed by, or associated with Google LLC. It reverse-engineers publicly accessible Google Maps interfaces for educational purposes only. Users are responsible for ensuring their usage complies with applicable laws, regulations, and Google’s Terms of Service.