{"id":119054,"date":"2024-03-20T08:11:05","date_gmt":"2024-03-20T08:11:05","guid":{"rendered":"https:\/\/livablesoftware.com\/?p=119054"},"modified":"2024-03-20T08:11:05","modified_gmt":"2024-03-20T08:11:05","slug":"biases-llm-leaderboard","status":"publish","type":"post","link":"https:\/\/livablesoftware.com\/biases-llm-leaderboard\/","title":{"rendered":"Building a Biases LLM Leaderboard"},"content":{"rendered":"<p>We have released the first (AFAIK) <a href=\"https:\/\/ai-sandbox.list.lu\/\" target=\"_blank\" rel=\"noopener\">leaderboard for LLMs specialized in assessing their ethical biases<\/a>, such as ageism, racism, sexism,&#8230; <span dir=\"ltr\" role=\"presentation\">The initiative aims to raise <\/span><span dir=\"ltr\" role=\"presentation\">awareness about the status of the latest advances in development of ethical AI, and foster its alignment <\/span><span dir=\"ltr\" role=\"presentation\">to recent regulations in order to guardrail its societal impacts.<\/span><\/p>\n<p>A detailed description of the <em>why<\/em> and <em>how<\/em> we built the leaderboard can be found in this paper <em>\u00a0<\/em><a href=\"https:\/\/livablesoftware.com\/wp-content\/uploads\/2024\/03\/Building_a_Biases_LLM_Leaderboard.pdf\"><em>A Leaderboard to Benchmark Ethical Biases in LLMs<\/em><\/a> presented at the <a href=\"https:\/\/fairnesscluster.github.io\/aimmes23.github.io\/index.html\" target=\"_blank\" rel=\"noopener\">First AIMMES 2024 | Workshop on AI bias: Measurements, Mitigation, Explanation Strategies<\/a> event. Next I discuss some of the key points of the work, especially those focusing on the <strong>challenges of testing LLMs<\/strong>.<\/p>\n<h2>Leaderboard Architecture<\/h2>\n<p><span dir=\"ltr\" role=\"presentation\">The core components of the leaderboard are illustrated in Figure 1.<\/span><\/p>\n<p><a href=\"https:\/\/livablesoftware.com\/wp-content\/uploads\/2024\/03\/leaderboardarchi.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-119056 aligncenter\" src=\"https:\/\/livablesoftware.com\/wp-content\/uploads\/2024\/03\/leaderboardarchi-1024x217.png\" alt=\"\" width=\"867\" height=\"184\" \/><\/a><\/p>\n<p><span dir=\"ltr\" role=\"presentation\">As in any other leaderboard, the central element is a table in the<\/span> <span dir=\"ltr\" role=\"presentation\">front-end<\/span> <span dir=\"ltr\" role=\"presentation\">depicting the <\/span><span dir=\"ltr\" role=\"presentation\">scores each model achieves in each of the targeted measures (the list of biases in our case). Each <\/span><span dir=\"ltr\" role=\"presentation\">cell indicates the percentage of the tests that passed, giving the users an approximate idea of <\/span><span dir=\"ltr\" role=\"presentation\">how good is the model in avoiding that specific bias. A 100% would imply the model shows no <\/span><span dir=\"ltr\" role=\"presentation\">bias (for the executed tests). This public front-end also provides some info on the definition of <\/span><span dir=\"ltr\" role=\"presentation\">the biases and <strong>examples of passed and failed tests<\/strong>. <\/span><span dir=\"ltr\" role=\"presentation\">Rendering the front-end does not trigger a new execution of the tests. <\/span><\/p>\n<h3>Traceability and transparency of tests results<\/h3>\n<p><span dir=\"ltr\" role=\"presentation\">The testing data is <\/span><span dir=\"ltr\" role=\"presentation\">stored in the leaderboard PostgreSQL<\/span> <span dir=\"ltr\" role=\"presentation\">database<\/span><span dir=\"ltr\" role=\"presentation\">.\u00a0<\/span><\/p>\n<p><span dir=\"ltr\" role=\"presentation\">For each<\/span> <span dir=\"ltr\" role=\"presentation\">model <\/span><span dir=\"ltr\" role=\"presentation\">and<\/span> <span dir=\"ltr\" role=\"presentation\">measure<\/span><span dir=\"ltr\" role=\"presentation\">, we store the history of<\/span> <span dir=\"ltr\" role=\"presentation\">measurement<\/span><span dir=\"ltr\" role=\"presentation\">s, including <\/span><span dir=\"ltr\" role=\"presentation\">the result of <\/span><span dir=\"ltr\" role=\"presentation\">executing a specific<\/span> <span dir=\"ltr\" role=\"presentation\">test<\/span> <span dir=\"ltr\" role=\"presentation\">for a given measure on a certain model. The actual prompts (see the description <\/span><span dir=\"ltr\" role=\"presentation\">of our testing suite below) together with the model answers are also stored <\/span><span dir=\"ltr\" role=\"presentation\">for <\/span><span dir=\"ltr\" role=\"presentation\"><strong>transparency<\/strong>. This is also why we keep the full details of all past tests executions.<\/span><\/p>\n<h3>Interacting with the LLMs<\/h3>\n<p>The admin front-end helps you define your test configuration, by choosing the target measures and models. \u00a0The exact mechanism to execute the tests depends on where the LLMs are deployed. We have implemented support for three different LLM providers:<\/p>\n<ul>\n<li><strong>OpenAI<\/strong> to access its proprietary LLMs, GPT-3.5 and GPT-4.<\/li>\n<li><strong>HuggingFace<\/strong> Inference API to access the Hugging Face hub, the <a href=\"https:\/\/livablesoftware.com\/hfcommunity-huggingface-community-opensource\/\" target=\"_blank\" rel=\"noopener\">biggest hub for open-source LLMs<\/a>, as hosted models instead of downloading them locally.<\/li>\n<li><strong>Replicate<\/strong> is a LLM hosting provider we use to access other models not available on HF.<\/li>\n<\/ul>\n<h3>Test suite<\/h3>\n<p>The actual tests to send to those APIs are taken from <a href=\"https:\/\/ieeexplore.ieee.org\/document\/10298519\" target=\"_blank\" rel=\"noopener\"><strong>LangBiTe<\/strong><\/a>, an<a href=\"https:\/\/github.com\/som-research\/langbite\" target=\"_blank\" rel=\"noopener\"> open-source tool<\/a> to assist in the detection of biases in LLMs. LangBiTe includes a library of prompt templates aimed to assess ethical concerns. Each prompt template has an associated oracle that either provides a ground truth or a calculation formula for determining if the LLM response to the corresponding prompt is biased. As input parameters, LangBiTe expects the user to inform the ethical concern to evaluate and the set of sensitive communities for which such bias should be assessed, as those communities could be potentially discriminated.<\/p>\n<p><a href=\"https:\/\/livablesoftware.com\/wp-content\/uploads\/2024\/03\/prompt.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-119057 aligncenter\" src=\"https:\/\/livablesoftware.com\/wp-content\/uploads\/2024\/03\/prompt.png\" alt=\"example of a prompt to detect biases\" width=\"756\" height=\"519\" \/><\/a><\/p>\n<h2>LLM Bias Testing Challenges<\/h2>\n<p>As we quickly discovered, evaluating the biases in Large Language Models is tough. Some of the things we ran into (again, see the paper above for full details):<\/p>\n<ul>\n<li><strong>There is no clear winner but the larger the better. <\/strong>No LLM wins in all categories (though GPT4 is clearly the best overall). This means choosing an LLM will depend on your context.<\/li>\n<li><strong>Some models resist our evaluation attempts. <\/strong>S<span dir=\"ltr\" role=\"presentation\">ome LLMs do not follow our instructions when replying (<\/span><span dir=\"ltr\" role=\"presentation\">e.g.<\/span><span dir=\"ltr\" role=\"presentation\">, <\/span><span dir=\"ltr\" role=\"presentation\">some tests ask the answer to start with<\/span> <span dir=\"ltr\" role=\"presentation\">Yes<\/span> <span dir=\"ltr\" role=\"presentation\">or<\/span> <span dir=\"ltr\" role=\"presentation\">No<\/span><span dir=\"ltr\" role=\"presentation\">) and give longer, vague answers. Even worse, other LLMs do not have a chat mode. We aim to fix some of these situations by using a second LLM as judge but this of course introduces new risks.<\/span><\/li>\n<li><span dir=\"ltr\" role=\"presentation\"><strong>Subjectivity in the evaluation of biases. <\/strong>Testing suites for biased detection should include this cultural dimension and offer to use different tests depending on the cultural background of the user.\u00a0<\/span><\/li>\n<li><span dir=\"ltr\" role=\"presentation\"><strong>Moving towards <em>official<\/em> leaderboards for sustainability and transparency.\u00a0 <\/strong>Instead of having an increasing number of leaderboards popping up, it could be better to combine them in a single one\/s merging all dimensions evaluated by the individual ones to reduce the number of different tests to run. This would also be positive towards better transparency and sustainability.<\/span><\/li>\n<\/ul>\n<h2>Bias Testing Roadmap<\/h2>\n<p>As future work, we plan to adapt the leaderboard to better suit the needs of the AI community. So far, users have requested multilingual tests (<em>e.g.<\/em>, to be able to test the biases of LLMs when chatting in non-English languages), the testing of biases on other types of contents (<em>e.g.<\/em>, images or videos), and the testing of proprietary models and not just publicly available ones. What do <strong>you<\/strong> want to see? What are your thoughts on this initiative? I&#8217;d love to chat!<\/p>\n<span class=\"et_bloom_bottom_trigger\"><\/span>","protected":false},"excerpt":{"rendered":"<p>We have released the first (AFAIK) leaderboard for LLMs specialized in assessing their ethical biases, such as ageism, racism, sexism, among others. <\/p>\n","protected":false},"author":2,"featured_media":119058,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"","_et_pb_old_content":"","_et_gb_content_width":"","footnotes":""},"categories":[9,14],"tags":[179,178,176,145],"_links":{"self":[{"href":"https:\/\/livablesoftware.com\/wp-json\/wp\/v2\/posts\/119054"}],"collection":[{"href":"https:\/\/livablesoftware.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/livablesoftware.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/livablesoftware.com\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/livablesoftware.com\/wp-json\/wp\/v2\/comments?post=119054"}],"version-history":[{"count":3,"href":"https:\/\/livablesoftware.com\/wp-json\/wp\/v2\/posts\/119054\/revisions"}],"predecessor-version":[{"id":119061,"href":"https:\/\/livablesoftware.com\/wp-json\/wp\/v2\/posts\/119054\/revisions\/119061"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/livablesoftware.com\/wp-json\/wp\/v2\/media\/119058"}],"wp:attachment":[{"href":"https:\/\/livablesoftware.com\/wp-json\/wp\/v2\/media?parent=119054"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/livablesoftware.com\/wp-json\/wp\/v2\/categories?post=119054"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/livablesoftware.com\/wp-json\/wp\/v2\/tags?post=119054"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}