Clean up legal text dataset for machine learning

Legal Text Corpus:

I want to clean up the text in the caselaw dataset, so that it contains only continuous text, without the citations, headers, and other metadata, so that I can publish the dataset for training a large language model on.

One dataset will be continuous text , and there will be another dataset of citations, where the citation text will be paired with a special token such as <|#FFFFFFFF|>, that will point to a citation to a particular reporter and page Such that a machine learning model can take text as input, and output the most likely citations that are relevant, or auto complete the most likely continuation of the sentence or paragraph.


would be transformed into

{“well-defined and narrowly limited classes of speech, the prevention and punishment of which have never thought to raise a any constitutional problem” : “<|#FFFFFFFF|>”}

{"<|#FFFFFFFF|>: “Champlinsky v. New Hampshire, 315 U.S. 568 571-572”}

Where “<|#FFFFFFFF|>” is the vector embedding of the citation “Champlinsky v. New Hampshire, 315 U.S. 568 571-572”

Grant Deliverables:

  • 2 datasets of data cleaned and ready to be trained on.

Endomorphosis aka Benjamin Jay Barber

I am not going to make a video, because I have nothing more to say on the matter, and I can only put two links in this post, and nobody wants to look at me talk considering I’ve been shot in the face.


Personally admire the approach to improve valuable open-source caselaw data set.

This could certainly provide foundation for further applications. According to Daniel Faggella, current applications of AI appear to fall in six major categories: Due diligence, Prediction technology, Legal analytics, Document automation, Intellectual property, Electronic billing.

In my opinion, prediction technology and legal analytics are the most appropriate in the context of the proposal.

Tangentially, the chat bot applications are popular with NLP among businesses.

This sounds like an interesting idea. IMHO it would help if you provided more details / boiled it down for folks unfamiliar with the dataset and general area of analysing publicly available dockets, including myself.

From what I understand, the link you’ve provided directs me to a REST API for a bunch of different data provided by the various endpoints offered by Court Listener. Having looked up the Free Law project (free[dot]law) and the Case Law site (case[dot]law) I found GitHub - harvard-lil/cap-examples: Examples for getting started using which I am assuming you’re already aware of. They seem to provide the text from some dockets already. I think I may be misunderstanding your goals so I wanted to ask if it could be made clearer.

For instance, are you saying that the Search | Caselaw Access Project and bulk data download + a few regex filters would not provide what you are aiming to provide (since – iiuc – it is a specific jurisdiction only)? I’d like to understand your goal(s) so that the focus could be on adding to what’s already out there instead of reinventing the wheel in case there is already a way to access the processed text.

If you’d like to provide more links, maybe share a doc or a page with all the references provided so that it is clear what the goals for the project are, I’d love to read more about it. Thanks for the work you’re doing!

The free law project has access to many more documents, which are all taken from PACER court document database, and additionally it has court opinions for every single common law jurisdiction, including state courts, patent courts, and the english courts. For example here is a copy of the documents from the docket for unsealing the search warrant o Donald Trumps house. United States v. Sealed Search Warrant, 9:22-mj-08332 –

However its not just a matter of using a couple of regexes, to process all of the data, unless the text generation will be noisy, and will insert random peoples names and references to documents that don’t exist. With regards to the court opinions, individuals names need to redacted and replaced with pronouns like defendant, friend, etc, citations to documents and procedural history need to be removed, because they will obviously not exist for the purposes of autocompletion. In addition, the headers and other metadata need to be removed, the citations need to be extracted, the footnotes extracted and placed inline with the main body text, dependent on if it is an explanation for the main body or simply a citation.

Hopefully all that will remain will be the facts, the procedural history, and the interpretation of the law to the facts and procedures. I am very familiar with what is out there already, and I would not train a model on there, because for example asking a system trained in such a way “is abortion a constitutional right” is going to give a wrong answer, given the subsequent change in case law. The issues where the law has changed, or a ruling has been overruled in whole or part, will have to be dealt with by traversing the citations in the documents, and looking for words that imply they’ve been overruled, or looking up the legal history of legislation. However i don’t have a good strategy for how to modify or redact source material, for opinions that have only been partially overruled or laws that have only been partially modified.

1 Like

Thanks for the clarification! In that case, is there a scope for how many documents you plan to apply this to, mainly from a ‘computational limitations and budget’ standpoint? Also, this honestly sounds like a lot of work so I would be in favor of carving out a smaller chunk of this for the current round–but you’re the expert on how much time and effort it will take so I would defer to your opinion.

To your point about redacting source material–would you want a language model that could help label the entities to redact and then define a workflow using some NLP pipeline for keyword-based replacement?

1 Like

yeah, I am familiar with named entity recognition, and the Spacy libraries, but for some of the more complicated stuff, i might try some GPT-3 prompt engineering, or perhaps OPT or T5 if is capable of some of the tasks.

I was initially going to start with the court opinions, but i dont know how much the computation budget would be for this undertaking.