Just Gmail

Hello world!

In this post I’m sharing a fun project I had to deliver for a much larger project: A Spam detector available as an API.

What are the requirements?

The span detector that I was tasked to create is part of a much larger system. Basically, the need is to test if an outbound email can be assymiled to a spam, before sending the email.

The #1 open source solution for spam detection is Spam Assassin. The question then is to make Spam Assassin available as an API.

The overall architecture is like the following:

System Architecture of SpamAssassin with FastAPI server

Basically, our user queries the Docker container that will forward the query to our FastAPI server. It will then send the content of the email to SpamAssassin to get the score of the email and a report. This will be sent to our original user in a nice JSON format.

Sounds good? Let’s get started…

The application

Let’s start by the docker image. Starting with an Ubuntu image, we install the necessary tools, most importantly SpamAssassin and Python for server API.

FROM ubuntu:22.04

# Prevent interactive prompts during package installation
ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y \
    spamassassin \
    spamc \
    python3 \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt

# Create app directory and cd into it
WORKDIR /app
COPY . .

# This updates SpamAssassin rules
RUN sa-update

RUN mkdir /var/run/spamd
CMD ["sh", "-c", "spamd --allow-tell --create-prefs --max-children 5 --helper-home-dir /var/lib/spamassassin -u debian-spamd & python3 app.py"]

EXPOSE 5000

NOTA: This is a simple and not optimized docker image. Consider it for education purpose only, adapt it for production.

The rest is just basic configuration, nothing obscure under the sun, except for the CMD in which we start SpamAssassin daemon and the Python server.

Building and running the docker container is done with the following:

docker build -t spam-detector .
docker run --rm -p 5000:5000 spam-detector

This will expose the whole application through port 5000.

For the Python server, I used FastAPI, with only two REST endpoints:

  • POST /check to check if the email is a Spam
  • GET /up as a health check

I’m using Pydantic to ensure typing, especially we are handling a JSON input, and returning a JSON too:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import subprocess
import tempfile
import os


app = FastAPI(title="SpamAssassin as a Service")


class EmailMessage(BaseModel):
    content: str
    threshold: float = 5.0  # Default spam score threshold


class SpamResponse(BaseModel):
    is_spam: bool
    spam_score: float
    report: str


@app.post("/check", response_model=SpamResponse)
async def check_spam(email: EmailMessage):
    try:
        # Create temporary file for email content
        with tempfile.NamedTemporaryFile(mode='w+', delete=False) as temp_file:
            temp_file.write(email.content)
            temp_file_path = temp_file.name

        # Run SpamAssassin check
        result = subprocess.run(
            ['spamc', '-c', temp_file_path],
            capture_output=True,
            text=True
        )

        # Get detailed report
        report = subprocess.run(
            ['spamc', '-R', temp_file_path],
            capture_output=True,
            text=True
        )

        # Clean up temporary file
        os.unlink(temp_file_path)

        # Parse spam score from report
        try:
            spam_score = float(result.stdout.strip())
        except ValueError:
            spam_score = 0.0

        return SpamResponse(
            is_spam=spam_score >= email.threshold,
            spam_score=spam_score,
            report=report.stdout
        )

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.get("/up")
async def health_check():
    try:
        # Check if SpamAssassin daemon is running
        subprocess.run(['pgrep', 'spamd'], check=True)
        return {"status": "healthy"}
    except subprocess.CalledProcessError:
        raise HTTPException(status_code=503, detail="SpamAssassin daemon is not running")

The POST /check endpoints expects a JSON body with the following information:

  • content which is the content of the email we want to send
  • threshold which is a floating number that scores the “spaminess” of our message. Basically, SpamAssassin will give the message a score based on some criteria, for example:
    • 2.0 BODY: Contains ‘click here’
    • 3.2 BODY: Excessive use of uppercase text
    • 2.7 MONEY: Mentions large sum of money
    • 4.5 SCAM: Typical prize scam phraseology
    • 2.1 FREEPRIZE: Contains ‘free’ and ‘prize’
    • 2.1 CREDIT_CARD: Asks for credit card details

By default we set the threshold to 5 if not provided by the body, but it can be customized for every query. It means that if the score given by SpamAssassin reaches or exceeds the threshold, it is marked as spam.

In our server, the JSON output is like the following:

{
  "is_spam": false,
  "spam_score": 0.3,
  "report": "* -0.0 BODY: Human generated mail text
   * 0.2 BODY: Contains business-related terms
   * 0.1 BODY: Contains formal greeting and signature

   SpamAssassin Score: 0.3 points
   Status: HAM (threshold 5.0)"
}

The nice thing is that SpamAssassin provides a report for its scoring. I really liked this because it can give explanations for a user on why his email is detected as a Spam.

Health check is testable with:

curl http://localhost:5000/up

and spam detection though:

curl -X POST "http://localhost:5000/check" \
     -H "Content-Type: application/json" \
     -d '{"content": "Buy now! Limited time offer!", "threshold": 5.0}'

Finally, here is the minimal requirements.txt:

fastapi>=0.68.0
uvicorn>=0.15.0
pydantic>=1.8.2
python-multipart>=0.0.5

Those packages will be installed in the Docker image on build.

Take away

This was really fun because I learned what makes an email a Spam. It’s a nice and cheap internal solution, but I can imagine that a large scale service of Spam-detection-as-a-service is much more complex.

SpamAssassin is not bullet proof. But again, this was a nice first step for teams that send emails in big volumes and don’t want them to be marked as spams.

Other solutions involve specialized SaaS applications or using LLMs to categorize messages, but those solutions can quickly get expensive depending on the emails volume.

See you soon.

Hassen