Technologies powering Politichook: a deep dive into the stack

Published on March 16, 2025

Politichook.com, a tool for tracking congressional stock trades, relies on a suite of technologies to scrape, process, and deliver real-time notifications. below is a technical breakdown of the stack, focusing solely on the tools and their roles.

Backend: python with flask

python: the core language driving politichook. its ecosystem supports rapid prototyping and integrates with libraries for web scraping, ocr, and ai.

  • libraries: requests for http calls, beautifulsoup for html parsing, and pytesseract for ocr interfacing.

flask: a lightweight microframework powering the backend.

  • role: manages api endpoints for scraping, processing filings, and triggering notifications.
  • why: minimal overhead, ideal for a small-scale application with heavy data-processing needs.

OCR engine: tesseract

tesseract: an open-source optical character recognition (ocr) engine maintained by google.

  • role: extracts text from pdf filings downloaded from disclosures-clerk.house.gov.
  • implementation: wrapped via pytesseract, with preprocessing handled by opencv (cv2) for image enhancement (e.g., contrast adjustment, noise reduction).
  • strengths: free, effective for typed text.
  • limits: struggles with handwriting or low-quality scans, requiring a fallback.

ai layer: openai gpt api

openai gpt api: a cloud-based natural language processing model.

  • role: interprets trade details (e.g., ticker, quantity, buy/sell) from ocr outputs when tesseract fails, especially on handwritten or ambiguous filings.
  • integration: called via rest api with a structured prompt: “extract stock trades from: [ocr text].”
  • optimization: invoked only when tesseract’s confidence score drops below 80%, minimizing api calls.
  • why: contextual understanding surpasses traditional regex or rule-based parsing.

Frontend: react

react: a javascript library for building user interfaces.

  • role: powers the client-side application where users select congress members and configure notifications.
  • features: component-based architecture for a fast, single-page app experience; state management via redux for tracking user preferences.
  • why: lightweight, responsive, and pairs well with a flask backend via restful apis.

Hosting and infrastructure: aws

amazon ec2: elastic compute cloud.

  • role: runs the flask backend and scheduled tasks (e.g., cron jobs for scraping).
  • specs: t3.micro instance for cost-efficiency, auto-scalable if traffic spikes.

amazon s3: simple storage service.

  • role: stores raw pdf filings and processed metadata.
  • why: cheap, durable, and integrates natively with other aws services.

amazon ses: simple email service.

  • role: sends email notifications to users.
  • implementation: triggered via flask using the boto3 sdk; supports throttling for free-tier delays.
  • why: cost-effective ($0.10 per 1,000 emails) and reliable for transactional emails.

dynamodb: nosql database.

  • role: tracks processed filings via hashed metadata to avoid duplicates.
  • why: fast lookups, serverless scaling.

Supporting tools

opencv: computer vision library (cv2).

  • role: preprocesses pdf scans (e.g., grayscale conversion, edge detection) to boost tesseract’s accuracy.

redis: in-memory data store.

  • role: queues delayed notifications for the free tier, ensuring cost control.
  • why: low-latency task management.

How it ties together

workflow: ec2’s cron job scrapes the disclosure site hourly, saving pdfs to s3. flask processes each file: opencv enhances images, tesseract extracts text, and gpt refines unclear outputs. react displays options to users, while ses delivers alerts based on redis queues or real-time triggers. dynamodb ensures efficiency by deduplicating filings.

scalability: aws handles growth—ec2 scales with compute demand, s3 with storage, and ses with email volume. gpt costs are the wildcard, tied to filing complexity.

This stack—python/flask, tesseract, gpt, react, and aws—forms a pipeline for turning congressional pdfs into actionable data, balancing performance, cost, and reliability.


Stay Updated on Congressional Trades

Get real-time alerts when Congress members make stock trades.

Start Free Trial

Share This Article