Robert Goulder of Tax Notes and Benjamin Alarie and Susan Massey, both of Blue J Legal, discuss the benefits and drawbacks of using generative AI models to assist with complicated tax questions and research.
Robert Goulder: Hello and welcome to the latest edition of In the Pages. I’m Bob Goulder with Tax Notes. Today we’re going to explore an issue that has been all over the news lately: artificial intelligence.
Our featured article asks how AI’s going to affect the future of legal research. Let’s say you’re a tax attorney, an accountant, enrolled agent, etc., in the future: Are you going to do your legal research the old-fashioned way, or are you going to present your research inquiry to an AI device and rely on whatever response you get back?
Now, I do not know the answer to that question, but it’s a great question. It raises all sorts of issues. So the featured article is titled “The Rise of Generative AI in Tax Research,” and it has four coauthors; they are Benjamin Alarie, Kim Condon, Susan Massey, and Christopher Yan.
And we are delighted to have two of those coauthors with us today: Ben Alarie and Susan Massey.
Welcome to In the Pages.
Benjamin Alarie: Thanks, Bob.
Susan Massey: Thank you.
Robert Goulder: As a preliminary matter, I should mention your professional affiliations. Ben wears two hats. He is a professor of business law at the University of Toronto. He’s also an entrepreneur, CEO of Blue J Legal based in Toronto. Ms. Massey is vice president of legal research with Blue J Legal.
And with that, let’s get to our first question. Ben, for the benefit of people out there listening who lack a tech background, how would you describe generative AI? And what exactly is a chatbot? Off the top of my head, how different is that from just a computer that can play chess?
Benjamin Alarie: So let me say, this is a great question, Bob. It’s a pleasure to be on the show and talk about this with you. I’m going to start with talking about generative AI. It is in the title of the article. Generative AI is a new kind of technology. It’s really come to prominence in the last, I would say, two to three years with systems that are able to generate new artifacts like images, text, audio, video, often simply from a text-based prompt. How this works is you enter some text and it can create new examples that mimic training data that the system has been exposed to.
For example, generative AI systems exist that can generate synthetic new photos of realistic looking human faces. Or systems can generate new poems in the style of Shakespeare. Generative AI generally uses machine learning coupled with neural networks to achieve this creative generation. In essence, generative AI is making up new stuff that resembles the training data.
So that’s the idea of generative AI. We’re going to explore this throughout the conversation.
A chatbot is something that I think is kind of self-explanatory to some extent. It’s an automated process that you can have a conversation with, and typically it’s text-based. You enter some prompt to the chatbot, the chatbot will reply with text in kind, and then you can reply to the chatbot and carry on a conversation. Most of these chatbots now are using generative AI to generate their responses. So they’re drawing on a large pool of information about past conversations, about scraped language from the internet, for example, and it’s generating realistic conversations that you can carry on with the chatbot. So really focusing on that conversational aspect.
And then you asked about differences from game-playing AI, so chess engines that have been around since Deep Blue beat [Garry] Kasparov in 1997. This is very familiar. We’ve got the cultural reference for this kind of game-playing AI. I think artificial intelligence has come a very long way from the 1990s with the very first really effective chess engines. The difference with generative AI, as opposed to these chess engines, is that generative AI is creating something new, and it’s not the traditional game-playing AI where there are hard-coded rules and then the system is responding to those rules. There’s something deeper and more nuanced that requires a lot more computational power to develop and inform that initial model.
There are, of course, game-playing AIs that are based on generative AI as well. But generative AI is not necessary to produce a chess-playing system. And we could geek out on these game-playing AIs and what they’re capable of doing. But suffice it to say, at this point, the very best chess-playing AIs can easily defeat the world’s strongest players. And I think we’re going to see the same sort of performance improvements in the future generations of generative AI that will be coming on stream.
So I think the next few decades are going to see vast improvements in generative AI, just like we’ve seen in those chess engines over the past 30 years.
Robert Goulder: I see. Susan, who can use generative AI? And how user-friendly is it? Let’s say, I mean, in my lead-in, I talked about attorneys, accountants, and enrolled agents using it. Let’s say those people have never been trained in IT; they can’t program at all. Are they still able to use it?
Susan Massey: Absolutely. And really to me, that’s the beauty of it. I don’t know if people have taken a look at ChatGPT, which is available for free. But you just type your questions to it, you can put in other prompts, and a lot of the generative AI that’s happening right now, it works the same way.
What I would say — you’ve seen articles about prompt engineering — it is really important to be able to figure out what kind of information you actually want to get out of it and ask explicitly. But this is the same sort of thing people adapted to when they had to learn how to Google. You can learn how to ask it what you want and get the feedback that you want.
Robert Goulder: I see. Now, your article in Tax Notes, what you do there is really interesting. You experiment with some established chatbots, asking the same research question to each of them. And if I’m honest, you’re getting some differences in the responses that you got back. Can you talk about these tools that you sampled?
Benjamin Alarie: Sure. So there were three different tools that we interacted with, Bob, in producing the research that went into the current article. The first two were different versions of ChatGPT. And one thing I’ll add on to the last question is ChatGPT is, as Susan said, really easy to use. It launched in November of last year, so November 30, 2022. By the end of January, it had reached a hundred million active users on the platform; it’s the fastest growing app ever. It really is intuitive to use, and people were having a lot of fun playing with it, mostly trying to break it if we’re honest, and typing in prompts to try to trick it or to make it say unusual or offensive things. I think a lot of people had that kind of experience and interaction with it, but it did grow extremely quickly, and so [it is] really easy to use.
We were leveraging ChatGPT in two of its versions. One was the original 3.5 — it’s called GPT-3.5 — which is the underlying foundational model that ChatGPT launched with and then also the newer GPT-4 version of ChatGPT was the second one from OpenAI. The third one is Ask Blue J, which is our own version of a conversational AI that is specially trained to do tax research.
So, what’s common about them is that they’re using foundational models trained on vast amounts of information. What’s different about it is the two ChatGPT tools are more general-purpose tools that are capable of carrying on conversations of a bunch of different kinds. Ask Blue J is really tailored to provide tax information and advanced tax research solutions.
What we found is that Ask Blue J generally provided more relevant, accurate, and useful responses compared to the other two ChatGPT tools. And that’s not surprising because it’s a specialized model trained specifically on tax law to do that. ChatGPT is expecting any kind of question that it could answer.
ChatGPT had some limitations, it provided some incorrect information, it hallucinated some answers, and it doesn’t provide transparency about its sources. And so that really, in our view, limits its reliability for tax practitioners. Ask Blue J aims to mitigate these issues through really focusing on tax training and careful curation of the information that it’s trained on.
Robert Goulder: Yeah, I think curation is probably the key concept there because when I was reading through your article — and I’m looking at the question and the different responses that you got — it did seem like the ones from your chatbot were completely different, something I’d probably be more likely to find useful for professional purposes.
But you mentioned these pitfalls, hallucinations, and so forth. Susan, what are these pitfalls? AI hallucinating, that just seems like a scary concept. What’s going on there? What are the cautions that we all need to keep in mind about this?
Susan Massey: Right. I think there are a few cautions, and Ben touched on them a little bit. So obviously, it can give you inaccurate information. It might just get something wrong. Maybe it pulls information that is of a similar phrasing to what you put in. It’s not quite the right answer, but it’s not completely fabricated.
The fabrication is what I find most concerning, what we call “hallucination.” And you’ll see it in one part of our paper where we asked it for sources and one of the chatbots actually gave us back a list of sources where it had completely fabricated the citations for it and fabricated the fact patterns for these PLRs (private letter rulings). That’s something that would be very concerning if you’re doing research.
I even heard a story of somebody who had sent in a journal article for review for a journal, and the person who returned it said that certain sources were missing. And upon investigation, it turned out that the reviewer had been using ChatGPT and the sources did not exist. And so that’s something you really need to be careful with in your professional life. Obviously, you want to make sure that you’re using accurate sources when you’re doing research. That would be horrifying. So that’s incredibly important to keep in mind.
When you are a practitioner and you’re using one of these tools, this is a research tool; it’s surfacing information for you. You are still responsible for checking it out and confirming its accuracy. The lack of transparency in the origination of your information is a huge problem with using GPT for any sort of research. It’s like if you’re in Google and you just pull anything from the web without checking your sources. It’s actually kind of worse because when you’re in Google, you can eventually get to where the original source is and see what it is. Whereas with ChatGPT, it tells you something, [but] you don’t know where it came from; maybe it came from Reddit, who knows? That’s something that we’ve certainly tried to address with our tools.
I think your second question was whether we can correct it. And the way we’ve done that is we’ve actually trained our tool that it’s not allowed to give information like that. So with the other generative AI models, what they do is if it can’t answer the question, it will try to search further and further to try to come up with anything to create an answer. And if ours runs into a situation where it’s unable to find an answer from the data that’s provided to it, it has to return an answer saying that it doesn’t know the answer.
Robert Goulder: Now I want to talk about sourcing, Ben. If there’s one thing lawyers are weird about, they’re very fastidious about cites and getting anything in the proper form. It can’t just say, “Oh, the tax court decided this.” I have to say when and where and how and was it affirmed on appeal or so forth. Can AI deal with that kind of sourcing that tax professionals are going to expect?
Benjamin Alarie: I think this is an extremely important issue. You’re absolutely right. Especially even among lawyers, tax lawyers are famous for putting an extra special premium on precision. So this is exactly the right question to ask.
A number of current chatbots have serious limitations in this area, but it is totally possible to design a system that’s grounded only on trusted sources of information. The key is to make sure that you identify what are the relevant sources and ground the model so that it’s only making reference to those authoritative sources.
That’s a ton of the work that we’ve done with Ask Blue J, is identifying what’s the relevant set of source materials that are authoritative, are going to be persuasive. So we’re talking about the regs; we’re talking about the Internal Revenue Code itself; we’re talking about various rulings; we’re talking about trusted commentary. These are the kinds of things that should qualify as relevant sources that the system should be making reference to. And when it makes reference to them, it needs to provide accurate citations.
And so the idea is, let’s get it so that it can produce a reliable version of the right citation so that every time, someone can go and pull up that reference document. Within Ask Blue J, we have all of the relevant materials in the platform. So if you’re having a conversation with Ask Blue J and it gives you an answer, it will identify the source, and you can click on that source and pull up the source document and read it in line. So that’s very important. It’s very comforting to a researcher to be able to see that source material and see the highlighted passage from that source material that’s the basis for that answer. The links are really important I think — being able to see that right in line and not having to go on a wild goose chase to try to find some potentially hallucinated source.
I had a student in one of my classes in January approach me, and she said, “Professor Alarie, I was doing this research; I was using ChatGPT. I found the perfect case to illustrate a point that I was making — and this is for another class, not for your class — and I asked a bunch of questions about the case. I was so excited because the more I asked about this case, the more ChatGPT was describing the perfect facts for the point that I wanted to argue. And then I asked for the citation. It provided a citation. And I went on to Westlaw, and I couldn’t find it on Westlaw. Then I went on LexisNexis; I couldn’t find it on LexisNexis. I went to Google; I couldn’t find it on Google. Then I went to the law librarian at the law school and asked her, ‘Can you help me find this case?’ And the law librarian tried all three of those things, and then she said, ‘This case doesn’t exist. This is completely hallucinated.'”
So having that link to the source is also extremely comforting and reassuring. And my student didn’t understand yet at that point that these systems can just completely hallucinate sources. And so, just to double down on Susan’s response, it really can happen and it looks totally plausible, and it looks like this could totally be a private letter ruling of this title that has this information in it and it doesn’t exist. So it really is important to curate the materials.
And then again, I would also reiterate limiting the system to looking at those verified sources is really important to make sure that it doesn’t confabulate — it’s not just inventing answers. Because going back to the original question you asked, what is generative AI? Generative AI is a system that’s going to produce language, and if you prompt it in a particular way, it’s going to produce statistically plausible language in response to your prompt. The big work that we’re doing is getting those systems [trained] so that they’re only going to make reference to authoritative materials. And that’s where their power can really be harnessed to do useful work.
Robert Goulder: Let’s try to wrap this up. Ben, a question for you: Where does this end? What’s the next wave? I mean, all this discussion has been premised on this idea of question and answer, right? There’s the chatbot, and I’m giving it an input, and it’s coming up, like you say, with language. What’s next? What goes beyond that? Is there a whole nother level of functionality that we can attain through AI? Maybe not now, but in the future, looking ahead?
Benjamin Alarie: I think the sky’s the limit with what we can ask these systems to do in terms of analysis and preparing language and those sorts of things. I think there are some limitations. I can talk about those limitations too. One scenario that I used to describe — what does the future look like for this kind of technology — kind of goes as follows:
Let’s suppose there’s a large corporation; it wants to acquire a particular other corporation. So we’ve got an acquirer and a target, and the management of the acquirer goes to the M&A (mergers and acquisitions) firm that’s going to help them with this — suppose it’s a big accounting firm — and they say, “We want to acquire this target Y.” In the future, it could be that that M&A practitioner goes to a system like Ask Blue J and says, “Okay, acquirer X wants to acquire target Y. And I’m going to make available all the tax and financial information about the acquirer. I’m going to make available all of the tax and financial information about the target. And I’m going to ask the system to present different scenarios for achieving this tie-up, achieving this acquisition.”
The system could work away for a bit and say, “Okay, here’s the plain-vanilla way to accomplish Y’s acquisition by X. And here are the tax results. Here are all the attributes; here’s all the metadata. Here are the steps. Here’s what you’d have to do to accomplish that. On the flip side, here is the most aggressive way that X could acquire Y from a tax perspective. It involves this very circuitous way of doing this with many more steps. And by the way, this involves a lot of tax risk because this may not pass muster with the economic substance doctrine or the step transaction doctrine. This is pretty risky, but if you pursued this and were successful” — and the system may say this is a 10 percent likelihood of being actually successful if it’s challenged — “here are the tax results.” And they’re phenomenal, massive tax savings, whatever. It’s found all sorts of efficiencies there from a tax perspective.
And then finally, “Here’s the Goldilocks scenario. Here’s where X acquires Y, and here’s the best trade-off between the tax advantages and how defensible the transaction is. Here are the steps, and here are a series of diagrams. We’ll do the step diagram, show you exactly how this is going to go. Here’s what has to happen, and we recommend, the system recommends, this Goldilocks transaction.”
Then it goes to the practitioner, who then says, “Okay, what about this? What about that?” [and] introduces a few additional facts that may change the analysis and basically has a conversation with the system about how to optimize this from a tax perspective. And then carries on and then goes and implements all of this stuff that needs to happen, maybe assisted by a different AI tooling system that can help with the document review, document preparation, all of the different steps, completing these forms potentially automatically, again with human review. And then, ideally, at the end of the day, just accelerates this very significantly, can provide all of this analysis to the relevant tax authorities and in a way that’s totally auditable and can be reviewed by the tax authorities’ AI, just to make sure that everything passes muster.
You might say then, “Well Ben, if it’s going to be so automated, what is the role, what is the future role of the M&A professional or the tax professional? Why even have them? Why not just have this self-executing? Why can’t the business people just get together and do this deal without the intermediaries?” I think the world is quite messy. We’re always encountering things that we haven’t seen before. Also, business people want to make sure if you’re going to be engaging in a very significant transaction, you want to make sure that experts are doing this.
So even if I can look up online, “How to perform some kind of surgery,” and I could get all that information about how to perform a surgery and maybe there are some great tools I could rent to perform surgery on my child, there’s no way I’m going to perform surgery on my child. I am going to go and take them to a surgeon. Even if the surgeon says, “Medical technology has progressed to the point where the surgery is completely automated, and I’m going to be here on standby just in case I’m needed, but I haven’t had to intervene for the last 422 operations of this kind.” I’d say, “That’s fantastic. I’m so glad you’re here. Go do this thing, and if something does go wrong, you’re going to be there to make sure things go right.” I’m going to trust that surgeon. I’m going to go [to] that surgeon every time.
And I think it’s the same way with important legal transactions that folks are doing. They’re going to want to know that somebody responsible is taking that on. And they understand the information necessary to do this, but also they’re taking responsibility for if something unexpected is encountered that they’re going to be able to solve it. And you’re giving it to the person who is extremely well situated, is an expert on that topic to take that through and to make it happen.
I think it’s too easy to jump to the conclusion that these systems are going to overtake humans. I really think of them as a very powerful ally in performing the work that we do as tax professionals.
Robert Goulder: I’m glad you mentioned that because — and that’s what’s in the back of people’s mind: Am I going to become professionally obsolete? Because I worked hard to get these credentials to be an attorney or an accountant, and what if the client just uses AI and cuts me out of the loop? I know people are thinking about that in the back of their minds.
Susan, I’ll give you the last word. Anything to add to that? I mean, you used to work for the IRS, right? Isn’t the Service going to have their own AI?
Susan Massey: I wouldn’t count on it, not anytime soon. But I mean, I hope so for my friends. But I don’t think that practitioners will become obsolete. I agree with Ben.
This is something that is going to be a really powerful tool, and people who adopt it sooner are going to find that they make great leaps with how fast they’re able to do things, and how quickly and really just thoroughly they’re able to research and complete the issues that are before them.
I think that it also gives you the opportunity to learn broader areas as well, the areas that touch on your particular topic. So if you’re a specialist in a particular area and you’re wondering, “Oh, how does that [section] 754 election work?” then that’s something that you’re able to look up, and you get some knowledge on it.
I think practitioners will find this to be a very powerful tool. I don’t think they will find that it is something that replaces them.
Robert Goulder: Let’s hope so. And there you have it. Once again, the featured article is titled “The Rise of Generative AI in Tax Research.” You can find it in Tax Notes. The coauthors are Benjamin Alarie, Kim Condon, Susan Massey, and Christopher Yan. Ben and Susan, thank you for being on the program.
Benjamin Alarie: Thanks for having us, Bob.
Susan Massey: Thank you.
Robert Goulder: Thanks. Until next time. Bye-bye.