[D] Published paper uses hardcoded seed and collapsed model to report fraudulent results


Inspired by an earlier post that called out an Apple ICLR paper for having an egregiously low quality benchmark, I want to mention a similar experience I had with a paper that also egregiously misrepresented its contributions. I had contacted the authors by raising an issue on their paper's github repository, publicly laying out why their results were misrepresented, but they deleted their repository soon after.

Fraudulent paper: https://aclanthology.org/2024.argmining-1.2/

Associated repository (linked to in paper): https://web.archive.org/web/20250809225818/https://github.com/GIFRN/Scientific-Fraud-etection

Problematic file in repository: https://web.archive.org/web/20250809225819/https://github.com/GIFRN/Scientific-Fraud-etection/blob/main/models/argumentation_based_fraud_detection.py

Backstory

uring the summer, I had gotten very interested in the fraudulent paper detector presented in this paper. I could run the author's code to recreate the results, but the code was very messy, even obfuscated, so I decided to rewrite the code over a number of days. I eventually rewrote the code so that I had a model that matched the author's implementation, I could train it in a way that matched the author's implementation, and I could train and evaluate on the same data.

I was very disappointed that my results were MUCH worse than were reported in the paper. I spent a long time trying to debug this on my own end, before giving up and going back to do a more thorough exploration of their code. This is what I found:

In the original implementation, the authors initialize a model, train it, test it on label 1 data, and save those results. In the same script, they then initialize a separate model, train it, test it on label 0 data, and save those results. They combined these results and reported it as if the same model had learned to distinguish label 1 from label 0 data. This already invalidates their results, because their combined results are not actually coming from the same model.

But there's more. If you vary the seed, you would see that the models collapse to reporting only a single label relatively often. (We know when a model is collapsed because it would always report that label, even when we evaluate it on data of the opposite label.) The authors selected a seed so that a model that collapsed to label 1 would run on the label 1 test data, and a non-collapsed model would run on label 0 test data, and then report that their model would be incredibly accurate on label 1 test data. Thus, even if the label 0 model had mediocre performance, they could lift their numbers by combining with the 100% accuracy of the label 1 model.

After making note of this, I posted an issue on the repository. The authors responded:

We see the issue, but we did this because early language models don't generalize OO so we had to use one model for fraudulent and one for legitimate

(where fraudulent is label 1 and legitimate is label 0). They then edited this response to say:

We agree there is some redundancy, we did it to make things easier for ourselves. However, this is no longer sota results and we direct you to [a link to a new repo for a new paper they published].

I responded:

The issue is not redundancy. The code selects different claim-extractors based on the true test label, which is label leakage. This makes reported accuracy invalid. Using a single claim extractor trained once removes the leakage and the performance collapses. If this is the code that produced the experimental results reported in your manuscript, then there should be a warning at the top of your repo to warn others that the methodology in this repository is not valid.

After this, the authors removed the repository.

If you want to look through the code…

Near the top of this post, I link to the problematic file that is supposed to create the main results of the paper, where the authors initialize the two models. Under their main function, you can see they first load label 1 data with load_datasets_fraudulent() at line 250, then initialize one model with bert_transformer() at line 268, train and test that model, then load label 0 data with load_datasets_legitimate() at line 352, then initialize a second model with bert_transformer at line 370.

Calling out unethical research papers

I was frustrated that I had spent so much time trying to understand and implement a method that, in hindsight, wasn't valid. Once the authors removed their repository, I assumed there wasn’t much else to do. But after reading the recent post about the flawed Apple ICLR paper, it reminded me how easily issues like this can propagate if no one speaks up.

I’m sharing this in case anyone else tries to build on that paper and runs into the same confusion I did. Hopefully it helps someone avoid the same time sink, and encourages more transparency around experimental practices going forward.

Leave a Reply