Leveraging Underrepresented Data: A Dive into Synthetic Data Generation with Ollama and Mistral

In the fast-paced domain of machine learning and data analytics, the statement “data is king” holds a significant truth. The richness and volume of data available can critically impact the effectiveness and accuracy of machine learning projects. However, most of the projects I’ve worked on have suffered from limited resources and domain-specific datasets, which often include underrepresented data. This often leads to a distorted learning process and a suboptimal outcome. This issue becomes particularly significant when navigating niche or multilingual domains.

Within my primary domain of financial analytics, accurate labeling of bank transactions is crucial for robust bank account analysis. Especially in Germany, where B2B bank account analytics and related credit risk assessments are still evolving, accurately labeling diverse transactions remains a challenge. My current model aims to classify these transactions to understand spending patterns and evaluate financial health. However, underrepresentation of certain labels compromises the model’s ability to distinguish critical transaction types, possibly leading to significant misclassifications.

Synthetic data generation emerges as a robust tool in the toolkit of data scientists, offering a solution to augment datasets with meaningful, artificially generated data that closely mirrors the attributes of real-world data. By harnessing this technique, we can not only bolster our datasets but also fine-tune model performance, paving the way for sharper insights, even in situations where genuine data is sparse or confidential.

There are numerous ways to generate synthetic data, like the Synthetic Minority Over-sampling Technique (SMOTE) or lexical replacements which are popular and effective in many cases. However, these methods sometimes fall short in preserving the nuanced characteristics and inherent structure of the original data, especially when dealing with complex or domain-specific scenarios. The latter, synonym replacements, can also lead to bias inside the model if the problem is complex and the person in charge is not 100% versed in the language or domain. That’s why I wanted to put a slightly different spin on the augmentation of text data to leverage underrepresented data.

Disclaimer: It’s important to note that this exploration is a means to provoke thought and share insights rather than a definitive solution to the challenges at hand.

The Imbalance Dilemma

My dataset in question is a large collection of bank account transactions from German SMEs. The data remains confidential and this article will only include fictional bank transactions.

Every bank transaction within this dataset has been labeled using a custom-built rule-based labeling web app —a tool that I might delve deeper into in another blog post. With over 45 distinct labels attributed to these transactions, the dataset presents a spectrum of transaction types. However, a closer look uncovers a striking imbalance: while some labels grace the dataset tens of thousands of times (30k-70k instances), others, particularly the marginalized ones, make a sparse appearance, ranging between 50 to 500 occurrences.

You might be wondering: if these labels are so underrepresented in the dataset, are they worth considering for the model? The answer is yes. Let me show you this in this following example. Among the least represented labels, two labels are the primary reason why I initiated this endeavor in the first place, “Pfändung” (account seizures), and “Lohnpfändung” (salary seizures). To the untrained eye, both might appear synonymous, but their implications for a company are quite different. A regular seizure, or “Pfändung,” often signals financial instability — a red flag that can deter financial institutions. On the other hand, “Lohnpfändung” pertains to the seizure of an employee’s salary, triggered by a creditor aiming to reclaim owed money. Crucially, this doesn’t reflect the financial health of the business itself.

Tool Overview

In this exploration, I employed Ollama and Mistral’s 7b model in the quest for an higher quality synthetic data generation. Ollama serves as a gateway, allowing seamless access to multiple LLMs directly on your machine. Mistral’s 7b model is a recently unveiled Apache licensed model with 7.3B parameters. While Mistral appears to surpass its counterparts in certain areas, this exploration could have been executed using any competent LLM.

Venturing into Synthetic Transaction Generation

With the advent of LLMs, I thought I could potentially ask the model to generate some transactions, so I started to play around with Ollama/Mistral 7b. One of my first learnings was that Mistral 7b is able to generate correctly formatted JSONs way better than I thought it would. Here’s an initial prompt I used:

[INST]Generate a json that includes a key "name" and bank "booking-text" and a completely random company name of your choosing as a string. [/INST]

This output should result in a fairly consistent output of a JSON string in the desired format. Here is an example:

{
    "name": "Acme Corp.",
    "booking-text": "Thank you for booking with us!",
    "company-name": "Random Company Name"
}

One of the quirks I quickly grasped is that the output text oftentimes included some sort of markdown formatting. To use the JSONs it is necessary to extract and validate the JSONs. This step was crucial for seamlessly integrating the generated data into my code.

Refining the Synthetic Data Generation Process

The ‘knowledge’ possessed by a Large Language Model (LLM) is fundamentally rooted in its training data. While this data covers a vast array of subjects and niches, it does not include your specific subject and your specific niche. At least to the extent that you have access to it (probably).* In my venture, this translated to a simple realisation: the LLM lacked familiarity with the format of my bank transactions in JSON. It needs my guidance to specify which fields I need.

* this holds especially true for general models as opposed to those fine-tuned and enriched for particular tasks

The Core Premise: Real to Synthetic

The core idea of my process rested on the premise of taking a real transaction from my dataset, exposing certain parts to the model, and instructing it to modify the transaction at will. This approach aimed at harnessing the model’s generative capabilities while maintaining a firm tether to real-world data.

The next step was to create a prompt to guide the model in generating a synthetic transaction based on the real one. Here’s one version of the prompt that gave the first satisfactory results: [INST] You will generate a bank transaction of the class "{label}". Include "main_account_holder": "{main_account_holder}" in your considerations. For this generate a JSON that includes keys "counter_holder", "purpose", and "main_account_holder". The counter_holder should be similar to this string: "{counter_holder}". In case it is a name insert a similar sounding name and in case it is a company name, use a similar company name. The purpose should be an array of strings which keeps the language, German, and the format but changes the vocabulary, numbers and dates slightly."{purpose}".[/INST]

>>> [INST] You will generate a bank transaction of the class "Nebenkosten des Geldverkehrs". Include "main_account_holder": "SampleAccount GmbH" in your considerations. For this generate a json that includes keys "counter_holder", "purpose", and "main_account_holder". The counter_holder should be similar to this string: "SampleAccount GmbH". In case it is a name insert a similar sounding name and in case it is a company name, use a similar company name. The purpose should be a array of strings which keeps the language, German, and the format but changes the vocabulary, numbers and dates slightly."['19% Umsatzsteuer auf EUR 6,90- Abrechnung per 31.12.2023 von Konto 123456789', 'NMSC+835+00905']".[/INST]

{
    "counter_holder": "SampleAccount GmbH",
    "purpose": ["Zuweisung von 19% Umsatzsteuer auf EUR 6.90", "Abrechnung per 31.12.2023 von Konto 
123456789"],
    "main_account_holder": "SampleAccount GmbH"
}

Once I saw the generated transactions, I noticed some issues:

Language Discrepancy: The model often generated transactions in English, while my dataset was mostly in German. Even changing the prompt to German didn’t help much, it rather decreased the quality of the output and struggled to output proper JSONs. Including the language in the prompt improved things slightly but didn’t completely solve the problem.
Missing Data/Fields: Some transactions in my dataset lacked certain data, like a missing purpose, which seemed to confuse the model and led to undesired results.
Too Much Similarity: The model had trouble generating diverse names and purposes, especially for outliers in my dataset.

With these insights, I made some adjustments to my approach:

Language Detection: I used langdetect to check the language and guide the model accordingly.
Prompt Segmentation: I changed the prompt to handle transactions with an empty purpose to prevent the model from getting confused.
Diversity Assessment: I introduced a similarity check to see if the data variation was too small, in case it was, I would retry to generate a new transaction based on the same original.

import langdetect
from langdetect import detect

def generate_synthetic_data_using_llm(transaction):
    x_labels = ', '.join(transaction['x_label'])
    counter_holder = transaction['counter_holder']
    main_account_holder = transaction['main_account_holder']
    purpose = str(transaction['purpose'])
    
    # Handling empty booking texts
    purpose_prompt = f'The purpose should be an array of strings which keeps the language, German, and the format but changes the vocabulary, numbers and dates slightly."{purpose}".' if purpose or len(transaction['purpose']) > 0 else ''

    prompt_text = f'[INST] You will generate a bank transaction of the class "{x_labels}". Include "main_account_holder": "{main_account_holder}" in your considerations. For this generate a json that includes keys "counter_holder", "purpose", and "main_account_holder". The counter_holder should be similar to this string: "{counter_holder}". In case it is a name insert a similar sounding name and in case it is a company name, use a similar company name. {purpose_prompt}[/INST]'

    synthetic_data = prompt_model(prompt_text)
    extracted_json = extract_json(str(synthetic_data))
    synthetic_json = json.loads(extracted_json) if extracted_json else None

    # Retry logic based on language
    retries = 3  # Set the number of retries
    while retries > 0:
        if not synthetic_json or detect(" ".join(str(p) for p in synthetic_json.get('purpose', ''))) != 'de':
            synthetic_data = prompt_model(prompt_text)
            extracted_json = extract_json(str(synthetic_data))
            synthetic_json = json.loads(extracted_json) if extracted_json else None
        else:
            break  # Break the loop if the purpose is in German or retries are exhausted
        retries -= 1

    if similar(counter_holder, main_account_holder) > 0.9:  # assuming a similarity ratio of 0.9 as significant
        synthetic_json['main_account_holder'] = synthetic_json['counter_holder']  # update main_account_holder

    return synthetic_json

Note: The snippet includes a few helper functions not shown here, but their purposes should be clear from their names.

The Second Iteration

The second iteration of my process stemmed from the core idea of the initial approach but aimed to address its shortcomings in a more robust manner. I mainly used the same checks implemented previously but changed the structure of the prompt to provide more context.

The goal of this iteration was to provide the model with a broader understanding of the transaction type and its variations within the dataset. Unlike the previous approach where the model only had sight of a single transaction, the new strategy involved showcasing one main transaction along with three others of the same category/label. The task for the model was then to alter the main transaction, drawing inspiration from the additional examples provided.

In practice, the prompt looked something like this:

[INST] You will generate a bank transaction of the class {label}. Include "main_account_holder": "{main_account_holder}" in your considerations. For this generate a json that includes keys "counter_holder", "purpose", and "main_account_holder". The counter_holder should be similar to this string: "{counter_holder}". In case it is a name insert a similar sounding name and in case it is a company name, use a similar company name. The purpose should be an array of strings which keeps the language, German, and the format but changes the vocabulary, numbers, and dates slightly."{purpose}". Here are some example transactions of the same type: Transaction 1: {transaction_1}, Transaction 2: {transaction_2}, Transaction 3: {transaction_3}[/INST]

>>> [INST] You will generate a bank transaction of the class “Nebenkosten des Geldverkehrs”. For this generate a json that includes keys “counter_holder”, “purpose”, and “main_account_holder”. The counter_holder should be similar to this string: “SampleAccount GmbH”.In case it is a name insert a similar sounding name and in case it is a company name, use a similar company name. The purpose should be an array of strings which keeps the language, German, and the format but changes the vocabulary, numbers, and dates slightly.“[‘19% Umsatzsteuer auf EUR 6,90- Abrechnung per 31.12.2023 von Konto 123456789’, ‘NMSC+835+00905’]“. Please change it but keep a similar relation to the “main_account_holder”: “SampleAccount GmbH” in your considerations. Here are some example transactions of the same type: Transaction 1: {main_account_holder: “RandomCompany GmbH”, counter_holder: “”, purpose: [“Entgeltabrechnung siehe Anlage”]}, Transaction 2: {main_account_holder: “DifferentCompany GmbH”, counter_holder: “”, purpose: [“Entgeltabrechnung siehe Anlage”]}, Transaction 3: {main_account_holder: “Deutsche UG (haftungsbeschränkt)“, counter_holder: “”, purpose: [“Entgelt fuer Echtzeit-Ueberweisung vom 04.10.2023 ueber 2.228,99 Euro 0,25 EUR Echtzeitueberweisung”, “GVC+808+00964”]}[/INST]
{
    "counter_holder": "ABC Corporation",
    "purpose": ["19% Umsatzsteuer auf EUR 7.50- Abrechnung per 31.12.2023 
von Konto 546789123","NMSC+810+1274"],
    "main_account_holder": "ABC Corporation"
}

Insights and Adjustments

The results were encouraging. The model exhibited a better performance, demonstrating a noticeable increase in variation. It was refreshing to see the model exploring a wider range of alterations, showcasing its ability to adapt based on the expanded context provided.

However, a new challenge surfaced. The model began altering parts of the transaction that served as identifiers, which was undesirable as it could deteriorate the data quality. To curb this, I refined the prompt to instruct the model on what elements to retain unchanged, specifically within the purpose field. The modified part of the prompt read as follows:

[…]changes the vocabulary, numbers, and dates slightly."{purpose}". However please do not change numbers that identify a transaction, these usually appear in the end and look similar to this GVC+000+00000 or NMSC+000+00000.[…]

The key takeaway here is that providing more context for the model leads to improved results. While this may seem like an apparent observation, it’s crucial to determine the most effective way to deliver this context to the model.

The Third Iteration

In this round, I decided to switch gears from showing the model a single main transaction. This time, I laid out five transactions from the same category and asked the model to come up something similar. The idea was to give the model a broader view of what transactions of a particular type look like.

For this, I revised the prompt to include more details from each transaction. Now, the model got to see the ‘booking_date,’ ‘counter_holder,’ ‘main_account_holder,’ ‘purpose,’ ‘amount,’ and ‘label’ for every transaction in the bunch. The updated prompt looked like this: [INST] I give you five examples for a bank transaction with the label "{label}". Please generate a new transaction based on these examples. The output should be a json and it should include all the fields of the examples. Change potential names and come up with new ones. Original Transaction 1: {'booking_date': '2023-01-01', 'amount': Decimal('-9.99'), 'currency_id': 'EUR', 'purpose': ['19% Umsatzsteuer auf EUR 9,99- Abrechnung per 31.12.2022 von Konto 123456789', 'NMSC+835+00905'], 'counter_holder': 'John Doe GmbH', 'main_account_holder': 'John Doe GmbH', 'label': ['Nebenkosten des Geldverkehrs']}, Original Transaction 2: …}[/INST]')

>>> [INST] I give you five examples for a bank transaction with the label “Nebenkosten des Geldverkehrs”. Please generate a new transaction based on these examples. The output should be a json and it should include all the fields of the examples. Change potential names and come up with new ones. Original Transaction 1: {“booking_date”: “2023-08-08”, “amount”: -0.6, “purpose”: [“Entgelt fuer Echtzeit-Ueberweisung vom 08.08.2023 ueber 1.000,00 Euro 0,50 EUR Echtzeit-Ueberweisung”, “NCHG+808+00964”], “counter_holder”: “Hallo GmbH”, “main_account_holder”: “Hallo GmbH”,  “label”: [“Nebenkosten des Geldverkehrs”]}, Original Transaction 2: {“booking_date”: “2023-04-01”, “amount”: -0.25, “purpose”: [“Entgelt fuer Echtzeit-Ueberweisung vom 01.04.2023 ueber 2.500,00 Euro 0,25 EUR Echtzeitueberweisung”, “NCHG+808+00964”], “counter_holder”: “Heute Nachrichten GmbH”, “main_account_holder”: “Heute Nachrichten GmbH”,  “label”: [“Nebenkosten des Geldverkehrs”]}, Original Transaction 3: {“booking_date”: “2023-12-01”, “amount”: -54.19, “currency_id”: “EUR”, “purpose”: [“ENTGELTABSCHLUSS Entgeltabrechnung siehe Anlage”], “counter_holder”: “”, “main_account_holder”: “Die Klempner KG”, “label”: [“Nebenkosten des Geldverkehrs”]}, Original Transaction 4: {“booking_date”: “2023-05-31", “amount”: -23.1, “purpose”: [“ENTGELTABSCHLUSS Entgeltabrechnung siehe Anlage”], “counter_holder”: “”,  “main_account_holder”: “AMAZING CRAFTER UG HAFTUNGSBESCHRÄNKT”, “label”: [“Nebenkosten des Geldverkehrs”]}, Original Transaction 5: {“booking_date”: “2023-03-01”, “amount”: -23.75, “purpose”: [“Bargeldauszahlung mit Kundenentgelt COMMERZBANK/GA-2302-BERLIN4/DE 2023-03-01 08:02:11 KMN 4 VJ 2303 Entgelt 3,75EUR”], “counter_holder”: “”, “main_account_holder”: “ZUGBEGLEITER GMBH”, “label”: [“Nebenkosten des Geldverkehrs”]}[/INST]
{
    "booking_date": "2023-10-01",
    "amount": -57.50,
    "currency_id": "EUR",
    "purpose": [
        "ENTGELTABSCHLUSS Entgeltabrechnung siehe Anlage"
    ],
    "counter_holder": "",
    "main_account_holder": "Epsilon Industries GmbH",
    "label": [
        "Nebenkosten des Geldverkehrs"
    ]
}

Notice that this time, I did not include any prompts about the language or structure. I simply relied on giving the model 5 similar looking JSONs and prompting it to generate new ones.

The key was to always show the model five totally unique transactions, picked at random. The thought was that by giving the model fresh examples each time, it would whip up unique outputs, adding more variety to the dataset.

With more potential fields in the mix, I added a quick check function that controls that the output includes all of the necessary fields. In case it doesn’t the model gets triggered again.

This added bit of complexity was well worth it. The model didn’t trip over the extra details. Instead, it churned out completely unique but similar-looking outputs that could pass for unique, underrepresented data.

The outputs this time around were pretty convincing. They were what I had in mind from the start – a good batch of synthetic data that looked real enough to fill the gaps in the dataset. However small adjustments and extra checks for the content were necessary to actually check the created transactions because the quality varied from time to time.

Concluding Thoughts

This journey showed me that generating synthetic data could be a viable solution, especially since I was able to create new and realistic-looking data. The small success in this initial exploration suggests it’s worth diving deeper into this method. However, one clear lesson is the importance of having domain-specific knowledge when working with Large Language Models. LLMs are great at generating text, but they can sometimes go off track, so having the right guidelines is crucial.

While the results are promising, I haven’t yet fully validated the possible biases or errors that could come from using the LLM. So, there’s still a lot more to learn and check. As I continue exploring this, I’m looking forward to learning more about how to effectively use synthetic data generation, and I’m excited about the potential improvements it could bring to my projects.