Technologies powering Politichook: a deep dive into the stack
Published on March 16, 2025
Politichook.com, a tool for tracking congressional stock trades, relies on a suite of technologies to scrape, process, and deliver real-time notifications. below is a technical breakdown of the stack, focusing solely on the tools and their roles.
Backend: python with flask
python: the core language driving politichook. its ecosystem supports rapid prototyping and integrates with libraries for web scraping, ocr, and ai.
- libraries:
requests
for http calls,beautifulsoup
for html parsing, andpytesseract
for ocr interfacing.
flask: a lightweight microframework powering the backend.
- role: manages api endpoints for scraping, processing filings, and triggering notifications.
- why: minimal overhead, ideal for a small-scale application with heavy data-processing needs.
OCR engine: tesseract
tesseract: an open-source optical character recognition (ocr) engine maintained by google.
- role: extracts text from pdf filings downloaded from disclosures-clerk.house.gov.
- implementation: wrapped via
pytesseract
, with preprocessing handled by opencv (cv2
) for image enhancement (e.g., contrast adjustment, noise reduction). - strengths: free, effective for typed text.
- limits: struggles with handwriting or low-quality scans, requiring a fallback.
ai layer: openai gpt api
openai gpt api: a cloud-based natural language processing model.
- role: interprets trade details (e.g., ticker, quantity, buy/sell) from ocr outputs when tesseract fails, especially on handwritten or ambiguous filings.
- integration: called via rest api with a structured prompt: “extract stock trades from: [ocr text].”
- optimization: invoked only when tesseract’s confidence score drops below 80%, minimizing api calls.
- why: contextual understanding surpasses traditional regex or rule-based parsing.
Frontend: react
react: a javascript library for building user interfaces.
- role: powers the client-side application where users select congress members and configure notifications.
- features: component-based architecture for a fast, single-page app experience; state management via redux for tracking user preferences.
- why: lightweight, responsive, and pairs well with a flask backend via restful apis.
Hosting and infrastructure: aws
amazon ec2: elastic compute cloud.
- role: runs the flask backend and scheduled tasks (e.g., cron jobs for scraping).
- specs: t3.micro instance for cost-efficiency, auto-scalable if traffic spikes.
amazon s3: simple storage service.
- role: stores raw pdf filings and processed metadata.
- why: cheap, durable, and integrates natively with other aws services.
amazon ses: simple email service.
- role: sends email notifications to users.
- implementation: triggered via flask using the
boto3
sdk; supports throttling for free-tier delays. - why: cost-effective ($0.10 per 1,000 emails) and reliable for transactional emails.
dynamodb: nosql database.
- role: tracks processed filings via hashed metadata to avoid duplicates.
- why: fast lookups, serverless scaling.
Supporting tools
opencv: computer vision library (cv2
).
- role: preprocesses pdf scans (e.g., grayscale conversion, edge detection) to boost tesseract’s accuracy.
redis: in-memory data store.
- role: queues delayed notifications for the free tier, ensuring cost control.
- why: low-latency task management.
How it ties together
workflow: ec2’s cron job scrapes the disclosure site hourly, saving pdfs to s3. flask processes each file: opencv enhances images, tesseract extracts text, and gpt refines unclear outputs. react displays options to users, while ses delivers alerts based on redis queues or real-time triggers. dynamodb ensures efficiency by deduplicating filings.
scalability: aws handles growth—ec2 scales with compute demand, s3 with storage, and ses with email volume. gpt costs are the wildcard, tied to filing complexity.
This stack—python/flask, tesseract, gpt, react, and aws—forms a pipeline for turning congressional pdfs into actionable data, balancing performance, cost, and reliability.
Stay Updated on Congressional Trades
Get real-time alerts when Congress members make stock trades.
Start Free Trial