Fine Tuning Quirk

I was working on content moderation for Change.org (see our full case study here). Our goal was to automatically identify violative content to flag for reviewers. The existing system employed several queries, one of which I found had a high percentage of false positives. It frequently flagged content that did not violate Change.org's rules for review. I decided that we should try fine-tuning a model to improve its accuracy in answering this question.

To start, we had hand-labeled data with "Allow" or "Remove" labels provided by Change.org. I initially attempted to fine-tune a model using just these labels, but I did not observe any performance improvement. I then wanted to try fine tuning a model with question answer pairs with full explanations. Here’s an example of an output I was looking for:

Analyzing the potential harms:

  1. Employment Risks:

    • Direct impact: The post does not directly target any individual's employment.

    • Indirect impact: Employees of the Public Security Intelligence Agency may feel undermined or attacked, but this is a systemic critique rather than a personal one.

  2. Relationship Damage:

    • The post does not mention any personal relationships, so it is unlikely to cause damage to someone's personal relationships.
  3. Bullying, Harassment, or Threats:

    • Individuals affiliated with Aleph might face increased scrutiny or harassment if perceived to be part of a controversial group. However, the post itself does 2not single out any individuals for harassment.

I didn’t actually have any full explanations for these posts, so I had to create them. Since I had “allow” or “remove” labels already I was able to prompt an LLM with variations of:

"If you were to reject this post [insert post], why would you reject it?"

Now we had questions answer pairs like

Q: you are a moderator at change.org ….. <post here> …
A: LLM generated answer that looks very much like the above

I then plugged these pairs into openpipe.ai to fine-tune GPT-3.5 and llama 3 7b. We found that GPT-3.5 was best, with a significant performance boost using just 25% of our fully labeled dataset as input for the fine-tuned model.

However, I discovered an interesting quirk: I had accidentally left out the actual post content in the questions used for fine-tuning so the questions still had the <post here> formatting tags. That is, the question half of the training data consisted only of the templated prompts above. Despite this, the improvement was significant!

Less strangely, after correcting this oversight and pairing actual posts with their moderation explanations, the performance improvements persisted.

So why did we see improvements from fine tuning without even having the corresponding questions? Here are my best guesses (would love to hear any other suggestions from other people)

  • The kinds of reasons that the LLM generated were helpful building blocks for the fine-tuned model for answering whether or not to allow a post

  • Some level of statistical correctness filtered through from the answers to the model so like if a post was a threat, the LLM will guess “Hey I don’t want this on my website that's a threat” and even though it didn’t know what the rules were for petitions on change.org on aggregate it would figure it out from the accepts/rejects.