Tabular data, consisting of samples (rows) and features (columns), is a prevalent data type in digital technology, increasingly used in open data published by public administrations. Despite its wide use, there’s a significant lack of tools that allow non-technical users to easily explore this data, which limits the public advantage from open data initiatives.

Conversational User Interfaces (CUIs) such as chatbots and voicebots could improve the accessibility of tabular data. Until now, chatbots for tabular data are either manually created (an option that is not scalable) or completely relying on general purpose LLMs (with limited capacity, especially for larger datasets, and with a risk of generating wrong answers

We are now introducing a scalable, no-code tool that automatically creates chatbots for tabular data based on a schema inferred from the data itself. Such schema can be optionally enhanced by the user, or automatically with the help of a Large Language Models (LLMs) if needed.

Our generated chatbots can handle a broad range of user intents and incorporate LLMs for generating responses through English-to-SQL translations when the intent corresponding to the user question is unrecognized. The entire setup is managed on our DataBot platform, which supports data import, chatbot management, and interactions via text or voice, providing outputs in tables or graph formats. This innovative approach not only simplifies the exploration and usage of tabular data for users without technical expertise but also offers organizations a convenient, effective method to enhance the value and utility of their data assets. The DataBot platform relies on our own BESSER Bot Framework, an open-source python-based library to build all types of bots.

We’re going to show a demo of the tool at CUI 2024, but keep reading for a preview of the tool architecture and main features!  You can also read the full paper. This work is led by Marcos Gómez with the collaboration of Robert Clarisó and yours truly.

Generation process

The image heading the post shows the workflow our tool follows to generate the bots from an initial tabular data source, depicted as a CSV file. The process is fully automatic, although data owner can optionally participate in the data schema enrich-ment step to generate more powerful bots. This enrichment can also be automated by using LLMs. The next subsections describe in more detail each step.

Data Schema inference

To automatically create a bot, we only need one ingredient: a tabular dataset. The dataset must follow a 2D structure, being composed by columns (attributes) and rows (records). Therefore, valid formats for this approach include CSV or XLSX, although other formats such as JSON or XML can be supported as long as they follow a tabular-like structure.

From the structure of the dataset,we will gather the list of columns/fields (with their names). From the analysis of the dataset content, we will infer the data type of each field (numeric, textual, date-time,…) and its diversity (number of different values present in that specific column). Based on a predefined (but configurable) diversity threshold, we automatically classify as categorical those fields under the threshold. Categorical fields are implemented as an additional bot entity so that users can directly refer to their values in any question. All this information conforms the metadata the bot will be trained on.

At this point, the chatbot can be already generated. Thanks to the schema inferred from the data source, the chatbot will be able to recognise certain kinds of questions relying on the knowledge extracted from the data (e.g., “Which is the maximum value in X?”, where X would correspond to one of the detected numeric columns’). Think of fields and rows as input and output parameters of the user’s questions the bot must be able to answer, e.g., users can ask for the value in field X of rows satisfying a certain condition in field Y.

Data Schema enhancement

Even though chatbots can be automatically created, they could have limited success for some data sources. This could happen, for instance, if the user uses words very distant to those present in the data schema. Another common problem is the semantics of fields. Some fields may compose an address (street, number, city, etc.) or a date (year, month, day, hour). This is something users could ask about, but they are probably not aware of the internal structure of the dataset and could ask about fields that actually do not explicitly exist as such.

Our bots’ philosophy is based on being sure in the answers they provide. Therefore, increasing the bot comprehension or permission to understand what is unknown could derive in a much higher failure rate. This can be considered a limitation but it is also as a safety mechanism. However, we provide the user with the necessary tools to optionally enhance their bots’ capabilities by enriching the automatically inferred data schema. As an example, the schema could be enriched by adding synonyms or creating new virtual columns (result of merging fields). These improvements can be done either manually by the bot creator or using a LLM to automatize the process.

 

Bot Generation

The bot generation phase takes the (potentially enriched) data schema and instantiates a set of predefined conversation patterns, gathered, improved and extended via several experiments with users, to generate the actual set of questions the bot will be trained on (i.e., the intents’ training sentences). The training phase of the bot will vary depending on the chosen intent classifier component. The main idea behind this is that this component learns the kind of questions it will receive by seeing example (training) annotated sentences. On top of this core component, the generator will add the fallback mechanism and other auxiliary conversations and components to create a fully functional bot.

Intents

The generated bots will contain a set of predefined intents, whose training sentences will be generated from a template bundle and completed with the data schema information3. These intents have been designed to suit as many datasets as possible and to match any dataset query that involves the columns and their values as embedded parameters. These queries mainly (but not only) include any SQL ‘select’ statement (in an equivalent natural language form), column comparisons, column or value frequencies, etc., regarding the tabular answer intents, and histogram, boxplot, bar, pie or line chart generation (among others) regarding the chart answer intents.

The advantage of this conversational model is that it is easily extensible with further intents allowing the integration of new bot capabilities by just defining the new intent and its proper response.

Entities

The bots contain a set of entities used to recognize relevant elements within intents through their parameters. These entities consist of a set of values, each of them containing an optional set of synonyms. In our context, the parameters the bots must recognize are mainly elements relative to the data they serve, like field names or values. In other words, their content depends on the data content. There are other data-independent entities such as operators (e.g., ‘maximum’ or ‘minimum’) or row names (e.g., ‘row’ or ‘entry’, though the user can add domain-specific row names, like ‘person’ or ‘employee’)

Using the generated bots

Once admin users are ready to generate a chatbot, they can go ahead just by pressing the Train & Run buttons to (locally) deploy the bot.

It will be available in the playground tab of the platform. The playground is the UI for the chatbot end-user (e.g., the citizen). It offers an interactive dashboard with a chat box on the left side of the canvas, together with a text input box and a voice input button. On the right side, there is the dashboard itself, composed of a set of tabs aimed at organizing the chatbot-generated content and configuration options. It is also possible to create filters, restricting the search space of the bot when generating an answer (e.g., filter by gender, before some date or with some numeric field lower than a threshold).

The Figures below show two different interactions with a chatbot that generated graphical and tabular answers, respectively.

Example of a databot interaction with a graphical response

Databot interaction with a tabular response

Next steps

As further work, we plan to enrich the training of the chatbots with the use of ontologies. The idea would be to map the data schema to ontological concepts to be able to consider more semantic information in the training. We also plan to extend the set of conversation patterns including questions on the validity, origin and possible biases of the data.

And you, what would you like to see next?

Join our Team!

Follow the latest news on software development, especially for open source projects

You have Successfully Subscribed!

Share This