How data science might influence and change the practice of law

One of the themes from this year’s ILTACON was the rising use of data in the legal profession. There were a few talks on the topics related to data science - from the use of data to improve internal efficiency of law firms to the use of neural networks to perform tasks that were impossible in the past.

The general conversations around the use of data fell into two camps:

  1. the use of data to improve internal efficiencies

  2. the use of data to improve external work products

The legal profession has become better at understand what data science can do and where it can apply to the practice of law, but I felt like there was still an air of mystique around how lawyers talk about data science. Some still spoke of data science as black boxed magic that can answer difficult questions with a wave of a white-gloved hand.

We were fortunate to have had a conversation with the very humble Aquib Javed Khan a few months ago. Aquib is a senior data scientist who had recently joined a legal tech company in Germany, and started to apply his wealth of experience to the legal profession. In our conversation, Aquib offered his view on the following topics:

  • Scope for increasing automation in the legal profession

  • Lack of data is a problem for the legal profession

  • Data capture versus data extraction

  • Start your data science journey by identifying the problem

  • Creating a dataset versus creating a model

  • Difference between data scientist and data engineer

  • How he approaches solving data science problems

  • Trying to solve machine learning problems without using machine learning techniques

Aquib’s big tip for how to get started on solving problems with data science:

... I always start with the very simple algorithms, or the very simple models, because if that works, at least I have a base line to move forward. And then, just going into the other big things nowadays, big data, transfer, big language models, especially in the NLP.
— Timestamp 14:05

Take a look at the video of our conversation below.


Aquib, welcome

0:01

we have the privilege of talking to

0:04

Aquib Javad Khan who is a data scientist

0:07

living in Germany who has just recently

0:09

started working for a legal tech company

0:12

um I keep welcome

0:14

it's a great talking to you again that's

0:16

it it's always a pleasure to chat um we

0:20

met in some really interesting

0:21

circumstances we were looking for

0:22

someone at the time to help us optimize

0:24

our Vector search

0:26

um engine and and we met and I was like

0:28

well we get on really really well um

0:31

this there's something going on here

0:32

there's a little bit of chemistry

0:34

um so I'm very glad we've kept in touch

0:36

and we've you know kept talking yeah but

0:39

when we talk and when we looked into the

0:41

problem is you're trying to solve and

0:43

that makes me like more curious because

0:46

uh I have been like working in search uh

0:50

or like kind of a similar kind of search

0:52

Project in my previous uh companies but

0:55

this was a completely different like a

0:58

whole legal space like it's uh because I

1:01

have no idea of like how these contacts

1:03

are made what is the context and how

1:04

there's so many information in that

1:06

document in the legal things it's like

1:09

still now also it for me to understand a

1:12

legal document is very difficult so

1:14

you're not the only one right like I

1:16

mean even lawyers struggle to understand

1:18

contracts sometimes and and that is the

1:20

whole basis for why there's so much

1:22

litigation where people disagree about

1:24

the interpretation of contracts

1:27

now that you've started at a legal tech

1:29

company

1:30

what are you finding what what what's

1:33

your kind of impression of

1:36

where the legal industry is and how far

1:39

it can progress if we can apply data

1:42

science to it or reply the right sorts

1:44

of their designs to it

1:46

uh data science

1:47

usually what we do so this is my

1:50

personal opinion actually I see data

1:52

science as like uh in any industry I see

1:54

uh the data science has like it can do

1:57

two two kinds of work the first is like

1:59

uh like in any business like you have to

2:02

want Revenue like you have to more

2:05

important Revenue so there are very few

2:08

Industries which I said and that's my

2:09

personal opinion uh I might be wrong a

2:12

very few industry where data science is

2:13

actually bringing up like bringing up

2:16

those values you know or like creating

2:17

revenues but most of the others which I

2:20

see is mostly like creating features in

2:23

in their own you know existing infrared

2:27

or the product and and they want to like

2:30

engage more and more uh people uh so

2:33

that they can engage from that feature

2:34

and they're trying to basically earn

2:37

some kind of revenue from there I mean

2:38

of course if we just uh like leave

2:41

recommendation and the search apart from

2:43

like any of the any product because of

2:45

course those are

2:47

um a very direct uh impact on the like

2:49

see if I talk about the e-commerce like

2:51

CTR or like CVS those are those have

2:53

like a direct uh let's say impact on

2:55

that but so either you're creating the

2:58

feature or like you are earning like

2:59

really uh like this analytics uh

3:02

whatever you are doing from data since

3:03

you are actually getting some revenue or

3:05

like direct Revenue uh so so far I'd say

3:08

like I've seen the financial

3:09

organization are quite quite good

3:11

actually in this I mean the second one

3:13

like how to use analytics and basically

3:15

generate Revenue uh and and so from my

3:19

experience in e-commerce what we did

3:21

like we actually try to build a lot of

3:24

good features that really looks good

3:26

sounds good it's interesting it can

3:29

engage people uh but the direct impact

3:32

on Revenue I I think I have not seen

3:34

much like as compared to the you know

3:36

other and now coming to Legal space what

3:39

I see uh so I'm not very sure about like

3:41

how much it can impact the like revenue

3:43

or the other stuff but uh directly but I

3:46

see the how the things we are trying to

3:49

automate I mean so right now it's more

3:50

about automation automating a lot of

3:52

stuff because uh legal take lawyer like

3:55

they read this dog documents like having

3:57

40 30 pages like that you know sometimes

3:59

it's discounted go

4:01

quite huge and there are people like in

4:04

the let's say people in the office like

4:06

who their job is to manually extract few

4:08

things uh from there these are really

4:10

very uh time taking uh job and it could

4:13

be frustrating as well I don't want to

4:15

use that word but it's just like that it

4:17

could be you know if you have to go 1000

4:19

documents in a 10 days you have to

4:21

figure out something some kind of uh

4:23

entities you have to figure out from

4:24

there I mean that could be a bit let's

4:27

say frustrating but so here

4:32

so so here like we can actually help

4:36

those people you know so I I always

4:38

believe like this uh machine learning or

4:40

like AI or dead sense whatever you call

4:42

it it's always we'll go parallely

4:44

together you know so we try to help so

4:47

that they can uh reduce so that we can

4:49

reduce some kind of like work from there

4:51

so they can do something else you know

4:52

manually or like uh we can make the

4:54

process faster so that uh these

4:56

algorithm will do the things and you'll

4:57

just have to verify whether they are

4:58

correct or not I mean um once you get

5:00

the solution and then looking to verify

5:02

it uh it's easy until this is a NP hard

5:05

problem

5:06

so so yeah so there we can actually help

5:11

so here I see there's a lot of scope in

5:13

automation you know and since these are

5:16

uh so legal uh space or like the similar

5:20

to this space they have lots of

5:22

processes actually and these processes

5:24

are I also observe uh that the the con

5:27

controllable parameters which I use like

5:30

if you have an e-commerce you have kind

5:33

of almost everything in your hand right

5:35

customer goes there you have the session

5:37

you have where he has click and

5:38

everything everything you can control

5:39

everything

5:40

but in this kind of space you don't have

5:43

all the control because the data will

5:45

come from something else it is not your

5:46

generator I mean e-commerce you get the

5:48

data it is generated from your own

5:50

website in a much more structured way I

5:53

would say but here it's like kind of

5:55

like no the structure but no structure I

5:58

mean you get for the most of the

5:59

document you can say I uh we find some

6:01

kind of a structure but not

6:04

you can't civilize those you know and

6:06

that is actually very uh very problem

6:08

very problematic you know it it it is

6:11

huge so I see this scope like here we

6:14

can really help and I believe this field

6:16

will definitely

6:18

um going to create a very good uh impact

6:20

on this like on this tiny uh automation

6:23

part you know where a machine can take

6:26

the decision you know instead of a human

6:28

so I think this part in this space will

6:31

go huge yeah that's still now what I

6:34

have understand it's it's such an

6:36

interesting thing because you're you're

6:37

talking about different stages of

6:39

adoption right like like depending on

6:42

where someone is in their technology

6:45

journey and their technology Readiness

6:47

their gonna want a different type of

6:50

data science tool to help them get to

6:52

the next stage and and what you're

6:55

talking about is in the very beginning

6:56

where you have all of this unstructured

6:58

documents that sit somewhere and until

7:00

you've turned that into useful

7:02

consumable data

7:04

there's not much you can do

7:07

um

7:07

and I think I think legal suffers from a

7:10

big problem where we have never a

7:13

lawyers and have never really

7:15

try to capture data in a structured

7:18

manner before

7:20

and because of that any change that we

7:24

try to we the legal tech industry try to

7:26

impose is going to be resisted

7:29

um so you can either design a really

7:33

slick solution that allows you to

7:35

capture the right data or you can design

7:37

a really clever solution that lets you

7:40

extract data from what already exists

7:42

and what the current behavior is

7:45

um which Camp do you fall into or do you

7:47

think both of these things should be

7:49

done

7:50

uh I'd say one common thing is the

7:52

information extraction which is uh I

7:55

would say in any kind of document or

7:56

processing document processing process

8:00

you know where you you really need

8:02

something that is information extraction

8:04

is very important because uh apart from

8:07

that I think yeah I just like leave my

8:11

answer to that part like it really

8:13

depends on the what kind of solution we

8:15

are trying to provide you know I I love

8:17

that basically you've turned my question

8:19

around and say well what's the problem

8:20

Zone

8:22

um

8:24

that's the right approach right and and

8:26

I was interviewing um I was interviewing

8:28

a potential engineer joining our team

8:30

and I was saying to her okay here's the

8:33

problem can you solve it for me and she

8:35

gave me this answer and then I was like

8:37

wait but you don't know what I want but

8:41

you don't know what what I want to do

8:42

with that data how can you give me an

8:43

answer

8:44

I have also faced One Challenge in the

8:46

industry is like um earlier like

8:48

whenever something starts we have no

8:50

idea like how we are going to integrate

8:52

data science or like a machine learning

8:54

thing or like anything putting in the

8:56

pipeline which is already running but

8:58

there we Face a really lot of problem

9:00

like problem like uh how how to have our

9:03

own like structured data structures to

9:05

store things what data we need what not

9:07

we you know so filtering these are

9:09

things so for my right now in this like

9:12

in my work in data center I would say 60

9:15

to 80 is more about creating the data

9:17

set connecting all the

9:20

points data points of the data databases

9:22

and you do like a lot of joining other

9:24

things to basically you know to get one

9:26

data set and then like creating a model

9:28

or something that's I would say pretty

9:30

easy uh nowadays with the so many

9:33

libraries you can use

9:35

um you've just said something really

9:36

interesting which we didn't record and

9:38

I'm going to repeat it and then I want

9:40

to dive into this point which is I

9:42

described you as a data engineer and

9:45

then you came back and said no no no no

9:46

no I'm a data scientist yeah

9:50

what's the difference in your mind

9:52

between data engineer and data

9:53

scientists

9:54

um yeah I think I can explain this um so

9:57

basically data integer people they take

9:59

care of the all the data pipelines uh

10:00

the flow of the data in the local

10:02

organization uh so all the pipelines and

10:05

everything they uh basically build

10:07

um all the building blocks of uh either

10:09

data engineering they take care of

10:10

everything like how we get the data and

10:13

how the flow will be what will be the

10:15

output and everything and in between

10:17

somewhere there we utilize at some stage

10:20

at some point of time we utilize some

10:22

data and we try to uh we try to build

10:25

model we try to basically automate a lot

10:26

of things so we see in that process like

10:29

where we can actually automate uh those

10:32

things you know which is like currently

10:34

manually doing

10:36

um or like which takes a lot of time so

10:39

we try to automate that using ml so I

10:42

would say um First Data engineering and

10:44

then the data science because until

10:46

unless you have the flow of data in the

10:48

organization uh you really collect the

10:50

the data which is really like required I

10:53

mean these things you have to identify

10:54

fight because they like I think in data

10:56

engineering people mostly they collect

11:00

all the data I mean I have seen like

11:01

whatever they can they just collect it

11:04

and in that way you have to understand

11:06

the business like as a data scientist

11:08

you have to look into like business and

11:10

right right so so in your mind the

11:13

difference between a data engineer and a

11:15

data scientist is the difference between

11:18

someone who is essentially building a

11:20

plumbing versus someone who's analyzing

11:23

the flow in the plumbing to go how can

11:25

we do this better

11:27

I I agree with you uh for the most of

11:30

the part uh but I think the uh where we

11:33

try to analyze the data but we uh

11:35

actually as like me I don't have any

11:38

like much idea on the data engineering

11:39

so how we can make those process faster

11:41

so there I won't be able to contribute

11:43

much but those people who are working in

11:46

that they are pretty fluent in that work

11:48

so they you know they understand where

11:51

the problem is but like my work is

11:52

mostly uh talk to the business people or

11:55

like my lead uh so from there we'll get

11:57

the project and we try to figure out

11:59

okay how we can basically get that data

12:02

in order to achieve some kind of a goal

12:05

uh could be like reducing manual work or

12:07

like could be generating some kind of a

12:09

feature or anything that could be useful

12:11

for our clients so there it's it's

12:14

mainly our work and uh I would also like

12:16

to say that once we build the model and

12:19

we have to like now productionize it uh

12:21

also I have a very less experience in

12:23

that I mean I Can Do I Can Do I can

12:25

create the apis and all I I know that

12:27

but I won't compare myself uh it will be

12:29

it will not go to that to compare like

12:32

you know if for some data engineering

12:34

guy or someone else is doing like taking

12:35

care of the scalability and

12:36

productionizing of model I wouldn't say

12:38

I'm that good with respect to them but

12:41

yeah so

12:43

um I'll open the video it's so hot yeah

12:46

there's a heat wave going on in in

12:48

Europe at the moment so hopefully you're

12:50

saying okay

12:51

um someone was telling me and this goes

12:53

back to the engineering point and we're

12:55

talking about the UK how in the UK this

12:58

heat wave has been so devastating

13:00

because it's not a country where the

13:02

architecture and the buildings have been

13:04

designed to tolerate that level of heat

13:06

yes so there's no air conditioning

13:09

there's

13:10

um yeah you just have fans not not even

13:13

a ceiling fan here because the height is

13:15

so you know

13:19

and that goes back to the design

13:22

question yeah if you're not designing a

13:25

framework to handle the situation and

13:28

the circumstances you're in you're going

13:30

to struggle down the track

13:32

and I would also say that in uh for me

13:35

I'm not that a great design person I

13:37

would say that but uh I'm a go I'm good

13:41

at like uh finding different solutions I

13:44

can say that like uh for one problem so

13:47

I don't like stick with like so of

13:50

course I started with very simple thing

13:52

you know um the first rule is like

13:54

solving any machine learning problem if

13:56

you can solve without that you know how

13:58

we can solve that one uh like using or

14:02

anything if it really depends on the

14:03

case to case but I always start with a

14:05

very simple algorithm so very simple

14:07

models because if that works at least I

14:10

have a baseline to move forward and then

14:12

just going into the other other big

14:14

things you know nowadays big transfer

14:16

big language models especially in the

14:18

NLP you just put it it will give you

14:20

something

14:22

you know it's so interesting when we

14:24

started working on on our stuff we were

14:27

talking to these like phds in data

14:29

science and and one of the phds Ben

14:32

Wilson kept asking me

14:33

yeah but do you need a model I mean the

14:37

lawyers actually need a language model

14:40

um and and yes they do but for specific

14:43

reasons right like depends on your

14:45

problem

14:46

um but it's really interesting that you

14:47

say hey

14:49

if the problem can be solved in a simple

14:51

way you use a simple simple approach

14:53

yeah and to kind of build on that

14:56

um

14:57

I was thinking two specific other ways

15:00

if if you were solving a data science

15:03

problem in any discipline legal whatever

15:07

I think there are three

15:10

um good starting points to overcome the

15:13

cold start problem one is

15:16

um you start with something something

15:18

simple like you say yeah second is you

15:21

talk to someone who's got experience and

15:24

you borrow from their experience that's

15:25

directly relevant

15:27

and third is you look at analogous

15:30

Industries and situations you go how did

15:32

they solve it yes um and I don't know if

15:35

there are others um but I think those

15:37

are three ways like that you can do to

15:40

uh solve the cold start problem

15:42

definitely this is actually this is the

15:45

way like whenever you start uh on an

15:47

inequal Court problem like when you have

15:49

to start uh definitely I also have uh in

15:53

me like kind of for confidence if I see

15:54

like someone has done it already for

15:56

this kind of a similar problem uh it

15:58

gives me some kind of like internal uh I

16:01

would say confidence or like something

16:03

that okay

16:05

this thing is proven you know what you

16:07

have to do is like just replicate on

16:08

your own data sets it should work and if

16:11

it doesn't work then it really then I

16:13

get very anxious like I have to look uh

16:15

into that yeah so every these all these

16:17

three steps I mean that's why I always

16:19

start with the simple thing because you

16:20

have a like kind of peace of mind that

16:21

at least you have some kind of Baseline

16:23

and then yeah obviously talk to you talk

16:26

to your teammates like because we we all

16:27

we do it and we really encourage like uh

16:30

talking in the team you know uh about

16:33

that like discussing okay I'm going to

16:35

do this one this is a new problem

16:37

um these are like uh things which I have

16:39

figured out to do can you just add some

16:41

points or like what are your ideas how

16:43

you're going to solve so basically

16:44

talking to other uh this really gives

16:47

like open

16:48

open to like more ideas like this

16:51

actually widen Your Horizon you know you

16:53

see okay yeah this can also be uh we can

16:56

try it out and then we select the again

16:58

the second thing is like which is the

17:00

second simplest model after this and

17:03

then you are going to that and but for

17:06

the third Point uh what happens with me

17:07

is like I actually first try to look at

17:10

try to find in Google okay is there any

17:12

similar problem I first look into that

17:14

that is my first instinct whenever I do

17:16

it then I find then I just keep it for

17:19

some time you know uh I can just leave

17:21

it first then I create a baseline

17:23

because for the Baseline I know what

17:24

things are going to be required because

17:25

that's a comfortable zone for me you

17:28

know you know I can create that and then

17:30

I'll jump to the third point you know

17:32

then taking the ideas from there and

17:34

then discussing uh then we just figured

17:36

so this is I would say my normal way to

17:39

work

17:40

um I keep thank you so much for making

17:42

time to chat today

17:44

um I'm going to cut this video together

17:45

and then I'm going to send it to you for

17:47

approval it's been an absolute pleasure

17:49

thanks a lot for this it was it was nice

17:52

nice meeting and talking to you

Transcript of conversation

Previous
Previous

Learning and applying “street fighting legal ops”

Next
Next

Using legal data that is hidden in words, lines, nooks and crannies