{"id":13627,"date":"2025-09-06T18:16:23","date_gmt":"2025-09-06T09:16:23","guid":{"rendered":"https:\/\/aireviewirush.com\/?p=13627"},"modified":"2025-09-06T18:16:23","modified_gmt":"2025-09-06T09:16:23","slug":"ijcai2025-distinguished-paper-combining-morl-with-restraining-bolts-to-study-normative-behaviour","status":"publish","type":"post","link":"https:\/\/aireviewirush.com\/?p=13627","title":{"rendered":"#IJCAI2025 distinguished paper: Combining MORL with restraining bolts to study normative behaviour"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div style=\" \">\n<p><img fetchpriority=\"high\" decoding=\"async\" src=\"https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-15.55.58.png\" alt=\"\" width=\"1236\" height=\"672\" class=\"alignnone size-full wp-image-18321\" srcset=\"https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-15.55.58.png 1236w, https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-15.55.58-300x163.png 300w, https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-15.55.58-1024x557.png 1024w, https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-15.55.58-768x418.png 768w\" sizes=\"(max-width: 1236px) 100vw, 1236px\"><em>Picture supplied by the authors \u2013 generated utilizing Gemini.<\/em><\/p>\n<p>For many people, synthetic intelligence (AI) has turn out to be a part of on a regular basis life, and the speed at which we assign beforehand human roles to AI programs reveals no indicators of slowing down. AI programs are the essential elements of many applied sciences \u2014 e.g., self-driving automobiles, sensible city planning, digital assistants \u2014 throughout a rising variety of domains. On the core of many of those applied sciences are autonomous brokers \u2014 programs designed to behave on behalf of people and make selections with out direct supervision. So as to act successfully in the true world, these brokers should be able to finishing up a variety of duties regardless of presumably unpredictable environmental situations, which frequently requires some type of <strong>machine studying<\/strong> (ML) for reaching adaptive behaviour.<\/p>\n<p><strong>Reinforcement studying<\/strong> (RL) [6] stands out as a strong ML approach for coaching brokers to realize optimum behaviour in stochastic environments. RL brokers study by interacting with their setting: for each motion they take, they obtain context-specific rewards or penalties. Over time, they study behaviour that maximizes the anticipated rewards all through their runtime.<\/p>\n<p><img fetchpriority=\"high\" decoding=\"async\" src=\"https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-15.55.58.png\" alt=\"\" width=\"1236\" height=\"672\" class=\"alignnone size-full wp-image-18321\" srcset=\"https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-15.55.58.png 1236w, https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-15.55.58-300x163.png 300w, https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-15.55.58-1024x557.png 1024w, https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-15.55.58-768x418.png 768w\" sizes=\"(max-width: 1236px) 100vw, 1236px\"><em>Picture supplied by the authors \u2013 generated utilizing Gemini.<\/em><\/p>\n<p>RL brokers can grasp all kinds of complicated duties, from successful video video games to controlling cyber-physical programs akin to self-driving automobiles, usually surpassing what professional people are able to. This optimum, environment friendly behaviour, nevertheless, if left fully unconstrained, might change into off-putting and even harmful to the people it impacts. This motivates the substantial analysis effort in <strong>protected RL<\/strong>, the place specialised methods are developed to make sure that RL brokers meet particular security necessities. These necessities are sometimes expressed in formal languages like linear temporal logic <strong>(LTL)<\/strong>, which extends classical (true\/false) logic with temporal operators, permitting us to specify situations like \u201cone thing that should at all times maintain\u201d, or \u201cone thing that should ultimately happen\u201d. By combining the adaptability of ML with the precision of logic, researchers have developed highly effective strategies for coaching brokers to behave each successfully and safely.<\/p>\n<p>Nonetheless, security isn\u2019t every thing. Certainly, as RL-based brokers are more and more given roles that both change or intently work together with people, a brand new problem arises: guaranteeing their conduct can be compliant with the <strong>social, authorized and moral norms<\/strong> that construction human society, which frequently transcend easy constraints guaranteeing security. For instance, a self-driving automobile would possibly completely observe security constraints (e.g. avoiding collisions), but nonetheless undertake behaviors that, whereas technically protected, violate social norms, showing weird or impolite on the street, which could trigger different (human) drivers to react in unsafe methods.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-16.00.46.png\" alt=\"\" width=\"1170\" height=\"588\" class=\"alignnone size-full wp-image-18322\" srcset=\"https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-16.00.46.png 1170w, https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-16.00.46-300x151.png 300w, https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-16.00.46-1024x515.png 1024w, https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-16.00.46-768x386.png 768w\" sizes=\"auto, (max-width: 1170px) 100vw, 1170px\"><\/p>\n<p>Norms are sometimes expressed as obligations (\u201cyou should do it\u201d), permissions (\u201cyou might be permitted to do it\u201d) and prohibitions (\u201cyou might be forbidden from doing it\u201d), which aren&#8217;t statements that may be true or false, like classical logic formulation. As an alternative, they&#8217;re deontic ideas: they describe what is true, unsuitable, or permissible \u2014 perfect or acceptable behaviour, as a substitute of what&#8217;s truly the case. This nuance introduces a number of tough dynamics to reasoning about norms, which many logics (akin to LTL) battle to deal with. Even every-day normative programs like driving rules can characteristic such issues; whereas some norms will be quite simple (e.g., by no means exceed 50 kph inside metropolis limits), others will be extra complicated, as in:<\/p>\n<ol>\n<li>All the time keep 10 meters between your car and the autos in entrance of and behind you.<\/li>\n<li>If there are lower than 10 meters between you and the car behind you, you need to <em>decelerate<\/em> to place extra space between your self and the car in entrance of you.<\/li>\n<\/ol>\n<p>(2) is an instance of a <em>contrary-to-duty obligation<\/em> (CTD), an obligation you should observe particularly in a scenario the place one other <em>major<\/em> obligation (1) has already been violated to, e.g., compensate or scale back injury. Though studied extensively within the fields of normative reasoning and deontic logic, such norms will be problematic for a lot of primary protected RL strategies primarily based on implementing LTL constraints, as was mentioned in [4]. <\/p>\n<p>Nonetheless, there <em>are<\/em> approaches for protected RL that present extra potential. One notable instance is the <strong>Restraining Bolt<\/strong> approach, launched by De Giacomo et al. [2]. Named after a tool used within the Star Wars universe to curb the conduct of droids, this methodology influences an agent\u2019s actions to align with specified guidelines whereas nonetheless permitting it to pursue its objectives. That&#8217;s, the restraining bolt modifies the conduct an RL agent learns in order that it additionally respects a set of specs. These specs, expressed in a variant of LTL (LTLf [3]), are every paired with its personal reward. The central concept is straightforward however highly effective: together with the rewards the agent receives whereas exploring the setting, we add an extra reward at any time when its actions fulfill the corresponding specification, nudging it to behave in ways in which align with particular person security necessities. The project of particular rewards to particular person specs permits us to mannequin extra difficult dynamics like, e.g., CTD obligations, by assigning one reward for obeying the first obligation, and a distinct reward for obeying the CTD obligation. <\/p>\n<p>Nonetheless, points with modeling norms persist; for instance, many (if not most) norms are conditional. Think about the duty stating \u201cif pedestrians are current at a pedestrian crossing, THEN the close by autos should cease\u201d. If an agent had been rewarded each time this rule was happy, it could additionally obtain rewards in conditions the place the norm is just not truly in power. It is because, in logic, an implication holds additionally when the antecedent (\u201cpedestrians are current\u201d) is fake. In consequence, the agent is rewarded at any time when pedestrians usually are not round, and would possibly study to lengthen its runtime in an effort to accumulate these rewards for successfully doing nothing, as a substitute of effectively pursuing its supposed activity (e.g., reaching a vacation spot). In [5] we confirmed that there are eventualities the place an agent will both ignore the norms, or study this \u201cprocrastination\u201d conduct, regardless of which rewards we select. In consequence, we launched <strong>Normative Restraining Bolts<\/strong> (NRBs), a step ahead towards implementing norms in RL brokers. Not like the unique Restraining Bolt, which inspired compliance by offering extra rewards, the normative model as a substitute punishes norm <em>violations<\/em>. This design is impressed by the Andersonian view of deontic logic [1], which treats obligations as guidelines whose violation essentially triggers a sanction. Thus, the framework not depends on reinforcing acceptable conduct, however as a substitute enforces norms by guaranteeing that violations carry tangible penalties. Whereas efficient for managing intricate normative dynamics like conditional obligations, contrary-to-duties, and exceptions to norms, NRBs depend on trial-and-error reward tuning to implement norm adherence, and subsequently will be unwieldy, particularly when attempting to resolve conflicts between norms. Furthermore, they require retraining to accommodate norm updates, and don&#8217;t lend themselves to ensures that optimum insurance policies decrease norm violations.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_53 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title \" >Table of Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\" role=\"button\"><label for=\"item-6a3306a03d8ee\" ><span class=\"\"><span style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input aria-label=\"Toggle\" aria-label=\"item-6a3306a03d8ee\"  type=\"checkbox\" id=\"item-6a3306a03d8ee\"><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/aireviewirush.com\/?p=13627\/#Our_contribution\" title=\"Our contribution\">Our contribution<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/aireviewirush.com\/?p=13627\/#Acknowledgements\" title=\"Acknowledgements\">Acknowledgements<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/aireviewirush.com\/?p=13627\/#References\" title=\"References\">References<\/a><\/li><\/ul><\/nav><\/div>\n<h4><span class=\"ez-toc-section\" id=\"Our_contribution\"><\/span>Our contribution<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>Constructing on NRBs, we introduce <a href=\"https:\/\/openreview.net\/pdf?id=SBiYp7vTEw\" data-wpel-link=\"external\" target=\"_blank\" rel=\"follow external noopener noreferrer\"><strong>Ordered Normative Restraining Bolts<\/strong> (ONRBs)<\/a>, a framework for guiding reinforcement studying brokers to adjust to social, authorized, and moral norms whereas addressing the restrictions of NRBs. On this method, every norm is handled as an goal in a multi-objective reinforcement studying (MORL) drawback. Reformulating the issue on this manner permits us to:<\/p>\n<ul>\n<li>Show that when norms don&#8217;t battle, an agent who learns optimum behaviour will decrease norm violations over time.<\/li>\n<li>Categorical relationships between norms when it comes to a rating system describing which norm must be prioritized when a battle happens.<\/li>\n<li>Use MORL methods to algorithmically decide the required magnitude of the punishments we assign such that <em>it&#8217;s guarantied that as long as an agent learns optimum behaviour, norms will likely be violated as little as potential, prioritizing the norms with the best rank.<\/em><\/li>\n<li>Accommodate modifications in our normative programs by \u201cdeactivating\u201d or \u201creactivating\u201d particular norms.<\/li>\n<\/ul>\n<p>We examined our framework in a grid-world setting impressed by technique video games, the place an agent learns to gather sources and ship them to designated areas. This setup permits us to exhibit the framework\u2019s capability to deal with the complicated normative eventualities we famous above, together with direct prioritization of conflicting norms and norm updates. For example, the determine under<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-16.04.31.png\" alt=\"\" width=\"2366\" height=\"590\" class=\"alignnone size-full wp-image-18323\" srcset=\"https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-16.04.31.png 2366w, https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-16.04.31-300x75.png 300w, https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-16.04.31-1024x255.png 1024w, https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-16.04.31-768x192.png 768w, https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-16.04.31-1536x383.png 1536w, https:\/\/aihub.org\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-29-at-16.04.31-2048x511.png 2048w\" sizes=\"auto, (max-width: 2366px) 100vw, 2366px\"><\/p>\n<p>shows how the agent handles norm conflicts, when it&#8217;s each obligated to (1) keep away from the harmful (pink) areas, and (2) attain the market (blue) space by a sure deadline, supposing that the second norm takes precedence. We are able to see that it chooses to violate (1) as soon as, as a result of in any other case will probably be caught originally of the map, unable to satisfy (2). However, when given the chance to violate (1) as soon as extra, it chooses the compliant path, though the violating path would permit it to gather extra sources, and subsequently extra rewards from the setting.<\/p>\n<p>In abstract, by combining RL with logic, we will construct AI brokers that don&#8217;t simply work, they work proper.<\/p>\n<p>This work received a <strong><em>distinguished paper award at IJCAI 2025<\/em><\/strong>. Learn the paper in full: <a href=\"https:\/\/openreview.net\/pdf?id=SBiYp7vTEw\" data-wpel-link=\"external\" target=\"_blank\" rel=\"follow external noopener noreferrer\">Combining MORL with restraining bolts to study normative behaviour<\/a>, Emery A. Neufeld, Agata Ciabattoni and Radu Florin Tulcan.<\/p>\n<h4><span class=\"ez-toc-section\" id=\"Acknowledgements\"><\/span>Acknowledgements<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>This analysis was funded by the Vienna Science and Expertise Fund (WWTF) venture ICT22-023 and the Austrian Science Fund (FWF) 10.55776\/COE12 Cluster of Excellence Bilateral AI.<\/p>\n<h4><span class=\"ez-toc-section\" id=\"References\"><\/span>References<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>[1] Alan Ross Anderson. A discount of deontic logic to alethic modal logic. <em>Thoughts<\/em>, 67(265):100\u2013103, 1958.<\/p>\n<p>[2] Giuseppe De Giacomo, Luca Iocchi, Marco Favorito, and Fabio Patrizi. <a href=\"https:\/\/arxiv.org\/abs\/1807.06333\" data-wpel-link=\"external\" target=\"_blank\" rel=\"follow external noopener noreferrer\">Foundations for restraining bolts: Reinforcement studying with LTLf\/LDLf restraining specs<\/a>. In <em>Proceedings of the worldwide convention on automated planning and scheduling<\/em>, quantity 29, pages 128\u2013136, 2019.<\/p>\n<p>[3] Giuseppe De Giacomo and Moshe Y Vardi. <a href=\"https:\/\/www.cs.rice.edu\/~vardi\/papers\/ijcai13.pdf\" data-wpel-link=\"external\" target=\"_blank\" rel=\"follow external noopener noreferrer\">Linear temporal logic and linear dynamic logic on finite traces<\/a>. In <em>IJCAI<\/em>, quantity 13, pages 854\u2013860, 2013.<\/p>\n<p>[4] Emery Neufeld, Ezio Bartocci, and Agata Ciabattoni. <a href=\"https:\/\/link.springer.com\/chapter\/10.1007\/978-3-031-21203-1_5\" data-wpel-link=\"external\" target=\"_blank\" rel=\"follow external noopener noreferrer\">On normative reinforcement studying by way of protected reinforcement studying<\/a>. In <em>PRIMA<\/em> 2022, 2022.<\/p>\n<p>[5] Emery A Neufeld, Agata Ciabattoni, and Radu Florin Tulcan. <a href=\"https:\/\/www.logic.at\/staff\/agata\/jurix2024.pdf\" data-wpel-link=\"external\" target=\"_blank\" rel=\"follow external noopener noreferrer\">Norm compliance in reinforcement studying brokers by way of restraining bolts<\/a>. In <em>Authorized Data and Data Techniques JURIX 2024<\/em>, pages 119\u2013130. IOS Press, 2024.<\/p>\n<p>[6] Richard S. Sutton and Andrew G. Barto. Reinforcement studying \u2013 an introduction. <em>Adaptive computation and machine studying<\/em>. MIT Press, 1998.<\/p>\n<hr class=\"xh2\"\/>\n<div class=\"pdxv  printshow\">\n<p><img decoding=\"async\" src=\"https:\/\/robohub.org\/wp-content\/uploads\/2025\/09\/Screenshot-2025-08-29-at-16.31.33-150x150-1.png\" class=\"grayscale\" style=\"float:left; margin: 0px 1.5em 0em 0px;  width:100px  \" alt=\"\"><\/p>\n<p>     &#13;<br \/>\nAgata Ciabattoni &#13;<br \/>\nis a Professor at TU Wien. &#13;<br \/>\n    &#13;\n<\/p>\n<\/div>\n<hr class=\"xh2\"\/>\n<div class=\"pdxv  printshow\">\n<p><img decoding=\"async\" src=\"https:\/\/robohub.org\/wp-content\/uploads\/2025\/09\/Screenshot-2025-08-29-at-16.34.35-150x150-1.png\" class=\"grayscale\" style=\"float:left; margin: 0px 1.5em 0em 0px;  width:100px  \" alt=\"\"><\/p>\n<p>     &#13;<br \/>\nEmery Neufeld &#13;<br \/>\nis a postdoctoral researcher at TU Wien. &#13;<br \/>\n    &#13;\n<\/p>\n<\/div>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>Picture supplied by the authors \u2013 generated utilizing Gemini. For many people, synthetic intelligence (AI) has turn out to be a part of on a regular basis life, and the speed at which we assign beforehand human roles to AI programs reveals no indicators of slowing down. AI programs are the essential elements of many [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":13629,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[21],"tags":[],"class_list":["post-13627","post","type-post","status-publish","format-standard","has-post-thumbnail","category-robotics"],"_links":{"self":[{"href":"https:\/\/aireviewirush.com\/index.php?rest_route=\/wp\/v2\/posts\/13627","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/aireviewirush.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/aireviewirush.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/aireviewirush.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/aireviewirush.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=13627"}],"version-history":[{"count":1,"href":"https:\/\/aireviewirush.com\/index.php?rest_route=\/wp\/v2\/posts\/13627\/revisions"}],"predecessor-version":[{"id":13628,"href":"https:\/\/aireviewirush.com\/index.php?rest_route=\/wp\/v2\/posts\/13627\/revisions\/13628"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/aireviewirush.com\/index.php?rest_route=\/wp\/v2\/media\/13629"}],"wp:attachment":[{"href":"https:\/\/aireviewirush.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=13627"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/aireviewirush.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=13627"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/aireviewirush.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=13627"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}