Open data is a wonderful idea, but we have already argued that all the public time and money invested in creating and publishing open data resources has a very minimal impact on our society. For instance, city halls complain that very few people use their open data sources. And the few that do are most data scientists or expert journalist. A regular citizen finds no use in all these open data initiatives.
In the last years, our contribution to solve the open data problem has focused on the backend, by trying to combine individual open data sources into richer, more useful ones.
We have new started a new project, the BODI project, funded by the Spanish Ministry of Science and Innovation and the EU’s Next Generation programme.
Conversational interfaces for open data
In this project, we focus on fixing the front-end. We feel that open data sources should offer a conversational interface if we want citizens to be able to benefit from them. Right now, releasing an open data set typically means uploading a CSV, XML or JSON file. None of them really usable for non-technical people.
I think we can all agree that being able to ask a chatbot a question like “what is the neighborhood with less average pollution in my city” is much easier than downloading the full CSV and starting “playing” and operating with the data to reach the same result.
Architecture of an open data chatbot
The architecture of our open data bots is described in this figure. As you can see, we combine a set of concrete questions the bot is able to recognize and answer with an advanced fallback strategy for those questions the bot doesn’t understand.
This advanced fallback strategy is important as the number of questions users can ask on a dataset is too broad. We try to predefine as many questions as possible but we need to assume that citizens will always be able to surprise us and ask questions that we didn’t expect.
Our fallback implementation is triggered every time the intent recognition component fails to match the citizen question. The fallback relies on the /TabularSemanticParsing model to automatically translate the user query to SQL that we then run on the CSV data thanks to ApacheDrill. For questions in Catalan or Spanish we need to first translate the query to English. We use the softcatala/opennmt-cat-eng and Helsinki-NLP/opus-mt-es-en models, respectively.
Obviously, the quality of the answer in this case is lower than for questions that the bot is actually trained with. But it’s better than nothing and has shown to be (surprisingly) accurate in numerous instances even when starting from Catalan and Spanish (which implies concatenating two language models, each one potentially introducing errors). Note that to maximize the chances of a good answer, we need to translate the names of the columns in all target languages so that the SQL translation model is able to match the attributes referenced in the question with column names in the CSV.
Massive generation of open data bots
With over 1M of open data sources (and this only counting those in the European data portal), creating the chatbots manually would not scale. For sure, city halls cannot afford to create a chatbot for every single data source, so a key component of the BODI project is to generate the chatbots automatically from the own open data set description.
For this, we are creating a set of heuristic rules that generate potential questions users could ask depending on the structure of the data source. Most of the time, this structure is limited to the names of the columns and an analysis of the data each column stores. For instance, if we can determine that a certain column holds integer values, our rules will generate questions to ask for the top value of that column, the average, all those that have a value greater than X, etc. Similarly, if another column holds date values, we will add questions to ask about results before (or after or between,…) a certain date. We could go one step further and generate questions not only based on the type of the data but based on the semantics of the concept represented by the column. E.g. if I know a column is not only an integer but that it represents a price I could generate predefined questions for price columns.
The systematic application of these heuristics over every column generates the set of questions the bot will be trained with. All this process can be completely automated so that given a CSV file (or any other tabular representation) we automatically generate three working bots (one for each language) ready to use. But with some simple configuration / annotation of the input CSV we can improve a lot the quality of conversation experience.
Let’s see some screenshots of the whole process.
From the web interface main menu we can import the data for a new bot, configure the generation process, the bot deployment properties (e.g. whether to deploy the bot on DialogFlow or on other NLP engines) and finally deploy the bots and get them running.
When importing the data we show a preview of the import results for verification.
We can then start with the configuration. All configuration is always optional. Via this form we can define the readable names for each column (instead of the sometimes obscure COL_X names and similars we find in the CSVs). We can also provide synonyms that help matching information requests with that column. All this in the three languages we support right now, though adding new languages would not really be a problem.
You can also merge fields (e.g. first name and last name) if the CSV keeps them separated but we know citizens we’ll use them together.
Finally, we have different sets of properties that configure the bot deployment and execution.
At this point we’re ready to deploy and start using the bot.
Can I try it with my data?
Absolutely. We’re obviously interested in testing how the BODI infrastructure works in different open data scenarios. Get in touch and we’ll take it from there.
Later on, we will release first some examples to validate the generator and once that feedback is integrated, we’ll start releasing all the BODI components as open source for those institutions that prefer to try it on their own while we will be available to help those that prefer a more assisted process (or even have us host the bots for them).
More long-term work include go beyond structured and tabular data and also be able to plug to the bot knowledge base full documents that the bot could use to answer additional questions based on Q&A language models.
Stay tuned for more!