How data science might influence and change the practice of law
One of the themes from this year’s ILTACON was the rising use of data in the legal profession. There were a few talks on the topics related to data science - from the use of data to improve internal efficiency of law firms to the use of neural networks to perform tasks that were impossible in the past.
The general conversations around the use of data fell into two camps:
the use of data to improve internal efficiencies
the use of data to improve external work products
The legal profession has become better at understand what data science can do and where it can apply to the practice of law, but I felt like there was still an air of mystique around how lawyers talk about data science. Some still spoke of data science as black boxed magic that can answer difficult questions with a wave of a white-gloved hand.
We were fortunate to have had a conversation with the very humble Aquib Javed Khan a few months ago. Aquib is a senior data scientist who had recently joined a legal tech company in Germany, and started to apply his wealth of experience to the legal profession. In our conversation, Aquib offered his view on the following topics:
Scope for increasing automation in the legal profession
Lack of data is a problem for the legal profession
Data capture versus data extraction
Start your data science journey by identifying the problem
Creating a dataset versus creating a model
Difference between data scientist and data engineer
How he approaches solving data science problems
Trying to solve machine learning problems without using machine learning techniques
Aquib’s big tip for how to get started on solving problems with data science:
Take a look at the video of our conversation below.
Aquib, welcome
0:01
we have the privilege of talking to
0:04
Aquib Javad Khan who is a data scientist
0:07
living in Germany who has just recently
0:09
started working for a legal tech company
0:12
um I keep welcome
0:14
it's a great talking to you again that's
0:16
it it's always a pleasure to chat um we
0:20
met in some really interesting
0:21
circumstances we were looking for
0:22
someone at the time to help us optimize
0:24
our Vector search
0:26
um engine and and we met and I was like
0:28
well we get on really really well um
0:31
this there's something going on here
0:32
there's a little bit of chemistry
0:34
um so I'm very glad we've kept in touch
0:36
and we've you know kept talking yeah but
0:39
when we talk and when we looked into the
0:41
problem is you're trying to solve and
0:43
that makes me like more curious because
0:46
uh I have been like working in search uh
0:50
or like kind of a similar kind of search
0:52
Project in my previous uh companies but
0:55
this was a completely different like a
0:58
whole legal space like it's uh because I
1:01
have no idea of like how these contacts
1:03
are made what is the context and how
1:04
there's so many information in that
1:06
document in the legal things it's like
1:09
still now also it for me to understand a
1:12
legal document is very difficult so
1:14
you're not the only one right like I
1:16
mean even lawyers struggle to understand
1:18
contracts sometimes and and that is the
1:20
whole basis for why there's so much
1:22
litigation where people disagree about
1:24
the interpretation of contracts
1:27
now that you've started at a legal tech
1:29
company
1:30
what are you finding what what what's
1:33
your kind of impression of
1:36
where the legal industry is and how far
1:39
it can progress if we can apply data
1:42
science to it or reply the right sorts
1:44
of their designs to it
1:46
uh data science
1:47
usually what we do so this is my
1:50
personal opinion actually I see data
1:52
science as like uh in any industry I see
1:54
uh the data science has like it can do
1:57
two two kinds of work the first is like
1:59
uh like in any business like you have to
2:02
want Revenue like you have to more
2:05
important Revenue so there are very few
2:08
Industries which I said and that's my
2:09
personal opinion uh I might be wrong a
2:12
very few industry where data science is
2:13
actually bringing up like bringing up
2:16
those values you know or like creating
2:17
revenues but most of the others which I
2:20
see is mostly like creating features in
2:23
in their own you know existing infrared
2:27
or the product and and they want to like
2:30
engage more and more uh people uh so
2:33
that they can engage from that feature
2:34
and they're trying to basically earn
2:37
some kind of revenue from there I mean
2:38
of course if we just uh like leave
2:41
recommendation and the search apart from
2:43
like any of the any product because of
2:45
course those are
2:47
um a very direct uh impact on the like
2:49
see if I talk about the e-commerce like
2:51
CTR or like CVS those are those have
2:53
like a direct uh let's say impact on
2:55
that but so either you're creating the
2:58
feature or like you are earning like
2:59
really uh like this analytics uh
3:02
whatever you are doing from data since
3:03
you are actually getting some revenue or
3:05
like direct Revenue uh so so far I'd say
3:08
like I've seen the financial
3:09
organization are quite quite good
3:11
actually in this I mean the second one
3:13
like how to use analytics and basically
3:15
generate Revenue uh and and so from my
3:19
experience in e-commerce what we did
3:21
like we actually try to build a lot of
3:24
good features that really looks good
3:26
sounds good it's interesting it can
3:29
engage people uh but the direct impact
3:32
on Revenue I I think I have not seen
3:34
much like as compared to the you know
3:36
other and now coming to Legal space what
3:39
I see uh so I'm not very sure about like
3:41
how much it can impact the like revenue
3:43
or the other stuff but uh directly but I
3:46
see the how the things we are trying to
3:49
automate I mean so right now it's more
3:50
about automation automating a lot of
3:52
stuff because uh legal take lawyer like
3:55
they read this dog documents like having
3:57
40 30 pages like that you know sometimes
3:59
it's discounted go
4:01
quite huge and there are people like in
4:04
the let's say people in the office like
4:06
who their job is to manually extract few
4:08
things uh from there these are really
4:10
very uh time taking uh job and it could
4:13
be frustrating as well I don't want to
4:15
use that word but it's just like that it
4:17
could be you know if you have to go 1000
4:19
documents in a 10 days you have to
4:21
figure out something some kind of uh
4:23
entities you have to figure out from
4:24
there I mean that could be a bit let's
4:27
say frustrating but so here
4:32
so so here like we can actually help
4:36
those people you know so I I always
4:38
believe like this uh machine learning or
4:40
like AI or dead sense whatever you call
4:42
it it's always we'll go parallely
4:44
together you know so we try to help so
4:47
that they can uh reduce so that we can
4:49
reduce some kind of like work from there
4:51
so they can do something else you know
4:52
manually or like uh we can make the
4:54
process faster so that uh these
4:56
algorithm will do the things and you'll
4:57
just have to verify whether they are
4:58
correct or not I mean um once you get
5:00
the solution and then looking to verify
5:02
it uh it's easy until this is a NP hard
5:05
problem
5:06
so so yeah so there we can actually help
5:11
so here I see there's a lot of scope in
5:13
automation you know and since these are
5:16
uh so legal uh space or like the similar
5:20
to this space they have lots of
5:22
processes actually and these processes
5:24
are I also observe uh that the the con
5:27
controllable parameters which I use like
5:30
if you have an e-commerce you have kind
5:33
of almost everything in your hand right
5:35
customer goes there you have the session
5:37
you have where he has click and
5:38
everything everything you can control
5:39
everything
5:40
but in this kind of space you don't have
5:43
all the control because the data will
5:45
come from something else it is not your
5:46
generator I mean e-commerce you get the
5:48
data it is generated from your own
5:50
website in a much more structured way I
5:53
would say but here it's like kind of
5:55
like no the structure but no structure I
5:58
mean you get for the most of the
5:59
document you can say I uh we find some
6:01
kind of a structure but not
6:04
you can't civilize those you know and
6:06
that is actually very uh very problem
6:08
very problematic you know it it it is
6:11
huge so I see this scope like here we
6:14
can really help and I believe this field
6:16
will definitely
6:18
um going to create a very good uh impact
6:20
on this like on this tiny uh automation
6:23
part you know where a machine can take
6:26
the decision you know instead of a human
6:28
so I think this part in this space will
6:31
go huge yeah that's still now what I
6:34
have understand it's it's such an
6:36
interesting thing because you're you're
6:37
talking about different stages of
6:39
adoption right like like depending on
6:42
where someone is in their technology
6:45
journey and their technology Readiness
6:47
their gonna want a different type of
6:50
data science tool to help them get to
6:52
the next stage and and what you're
6:55
talking about is in the very beginning
6:56
where you have all of this unstructured
6:58
documents that sit somewhere and until
7:00
you've turned that into useful
7:02
consumable data
7:04
there's not much you can do
7:07
um
7:07
and I think I think legal suffers from a
7:10
big problem where we have never a
7:13
lawyers and have never really
7:15
try to capture data in a structured
7:18
manner before
7:20
and because of that any change that we
7:24
try to we the legal tech industry try to
7:26
impose is going to be resisted
7:29
um so you can either design a really
7:33
slick solution that allows you to
7:35
capture the right data or you can design
7:37
a really clever solution that lets you
7:40
extract data from what already exists
7:42
and what the current behavior is
7:45
um which Camp do you fall into or do you
7:47
think both of these things should be
7:49
done
7:50
uh I'd say one common thing is the
7:52
information extraction which is uh I
7:55
would say in any kind of document or
7:56
processing document processing process
8:00
you know where you you really need
8:02
something that is information extraction
8:04
is very important because uh apart from
8:07
that I think yeah I just like leave my
8:11
answer to that part like it really
8:13
depends on the what kind of solution we
8:15
are trying to provide you know I I love
8:17
that basically you've turned my question
8:19
around and say well what's the problem
8:20
Zone
8:22
um
8:24
that's the right approach right and and
8:26
I was interviewing um I was interviewing
8:28
a potential engineer joining our team
8:30
and I was saying to her okay here's the
8:33
problem can you solve it for me and she
8:35
gave me this answer and then I was like
8:37
wait but you don't know what I want but
8:41
you don't know what what I want to do
8:42
with that data how can you give me an
8:43
answer
8:44
I have also faced One Challenge in the
8:46
industry is like um earlier like
8:48
whenever something starts we have no
8:50
idea like how we are going to integrate
8:52
data science or like a machine learning
8:54
thing or like anything putting in the
8:56
pipeline which is already running but
8:58
there we Face a really lot of problem
9:00
like problem like uh how how to have our
9:03
own like structured data structures to
9:05
store things what data we need what not
9:07
we you know so filtering these are
9:09
things so for my right now in this like
9:12
in my work in data center I would say 60
9:15
to 80 is more about creating the data
9:17
set connecting all the
9:20
points data points of the data databases
9:22
and you do like a lot of joining other
9:24
things to basically you know to get one
9:26
data set and then like creating a model
9:28
or something that's I would say pretty
9:30
easy uh nowadays with the so many
9:33
libraries you can use
9:35
um you've just said something really
9:36
interesting which we didn't record and
9:38
I'm going to repeat it and then I want
9:40
to dive into this point which is I
9:42
described you as a data engineer and
9:45
then you came back and said no no no no
9:46
no I'm a data scientist yeah
9:50
what's the difference in your mind
9:52
between data engineer and data
9:53
scientists
9:54
um yeah I think I can explain this um so
9:57
basically data integer people they take
9:59
care of the all the data pipelines uh
10:00
the flow of the data in the local
10:02
organization uh so all the pipelines and
10:05
everything they uh basically build
10:07
um all the building blocks of uh either
10:09
data engineering they take care of
10:10
everything like how we get the data and
10:13
how the flow will be what will be the
10:15
output and everything and in between
10:17
somewhere there we utilize at some stage
10:20
at some point of time we utilize some
10:22
data and we try to uh we try to build
10:25
model we try to basically automate a lot
10:26
of things so we see in that process like
10:29
where we can actually automate uh those
10:32
things you know which is like currently
10:34
manually doing
10:36
um or like which takes a lot of time so
10:39
we try to automate that using ml so I
10:42
would say um First Data engineering and
10:44
then the data science because until
10:46
unless you have the flow of data in the
10:48
organization uh you really collect the
10:50
the data which is really like required I
10:53
mean these things you have to identify
10:54
fight because they like I think in data
10:56
engineering people mostly they collect
11:00
all the data I mean I have seen like
11:01
whatever they can they just collect it
11:04
and in that way you have to understand
11:06
the business like as a data scientist
11:08
you have to look into like business and
11:10
right right so so in your mind the
11:13
difference between a data engineer and a
11:15
data scientist is the difference between
11:18
someone who is essentially building a
11:20
plumbing versus someone who's analyzing
11:23
the flow in the plumbing to go how can
11:25
we do this better
11:27
I I agree with you uh for the most of
11:30
the part uh but I think the uh where we
11:33
try to analyze the data but we uh
11:35
actually as like me I don't have any
11:38
like much idea on the data engineering
11:39
so how we can make those process faster
11:41
so there I won't be able to contribute
11:43
much but those people who are working in
11:46
that they are pretty fluent in that work
11:48
so they you know they understand where
11:51
the problem is but like my work is
11:52
mostly uh talk to the business people or
11:55
like my lead uh so from there we'll get
11:57
the project and we try to figure out
11:59
okay how we can basically get that data
12:02
in order to achieve some kind of a goal
12:05
uh could be like reducing manual work or
12:07
like could be generating some kind of a
12:09
feature or anything that could be useful
12:11
for our clients so there it's it's
12:14
mainly our work and uh I would also like
12:16
to say that once we build the model and
12:19
we have to like now productionize it uh
12:21
also I have a very less experience in
12:23
that I mean I Can Do I Can Do I can
12:25
create the apis and all I I know that
12:27
but I won't compare myself uh it will be
12:29
it will not go to that to compare like
12:32
you know if for some data engineering
12:34
guy or someone else is doing like taking
12:35
care of the scalability and
12:36
productionizing of model I wouldn't say
12:38
I'm that good with respect to them but
12:41
yeah so
12:43
um I'll open the video it's so hot yeah
12:46
there's a heat wave going on in in
12:48
Europe at the moment so hopefully you're
12:50
saying okay
12:51
um someone was telling me and this goes
12:53
back to the engineering point and we're
12:55
talking about the UK how in the UK this
12:58
heat wave has been so devastating
13:00
because it's not a country where the
13:02
architecture and the buildings have been
13:04
designed to tolerate that level of heat
13:06
yes so there's no air conditioning
13:09
there's
13:10
um yeah you just have fans not not even
13:13
a ceiling fan here because the height is
13:15
so you know
13:19
and that goes back to the design
13:22
question yeah if you're not designing a
13:25
framework to handle the situation and
13:28
the circumstances you're in you're going
13:30
to struggle down the track
13:32
and I would also say that in uh for me
13:35
I'm not that a great design person I
13:37
would say that but uh I'm a go I'm good
13:41
at like uh finding different solutions I
13:44
can say that like uh for one problem so
13:47
I don't like stick with like so of
13:50
course I started with very simple thing
13:52
you know um the first rule is like
13:54
solving any machine learning problem if
13:56
you can solve without that you know how
13:58
we can solve that one uh like using or
14:02
anything if it really depends on the
14:03
case to case but I always start with a
14:05
very simple algorithm so very simple
14:07
models because if that works at least I
14:10
have a baseline to move forward and then
14:12
just going into the other other big
14:14
things you know nowadays big transfer
14:16
big language models especially in the
14:18
NLP you just put it it will give you
14:20
something
14:22
you know it's so interesting when we
14:24
started working on on our stuff we were
14:27
talking to these like phds in data
14:29
science and and one of the phds Ben
14:32
Wilson kept asking me
14:33
yeah but do you need a model I mean the
14:37
lawyers actually need a language model
14:40
um and and yes they do but for specific
14:43
reasons right like depends on your
14:45
problem
14:46
um but it's really interesting that you
14:47
say hey
14:49
if the problem can be solved in a simple
14:51
way you use a simple simple approach
14:53
yeah and to kind of build on that
14:56
um
14:57
I was thinking two specific other ways
15:00
if if you were solving a data science
15:03
problem in any discipline legal whatever
15:07
I think there are three
15:10
um good starting points to overcome the
15:13
cold start problem one is
15:16
um you start with something something
15:18
simple like you say yeah second is you
15:21
talk to someone who's got experience and
15:24
you borrow from their experience that's
15:25
directly relevant
15:27
and third is you look at analogous
15:30
Industries and situations you go how did
15:32
they solve it yes um and I don't know if
15:35
there are others um but I think those
15:37
are three ways like that you can do to
15:40
uh solve the cold start problem
15:42
definitely this is actually this is the
15:45
way like whenever you start uh on an
15:47
inequal Court problem like when you have
15:49
to start uh definitely I also have uh in
15:53
me like kind of for confidence if I see
15:54
like someone has done it already for
15:56
this kind of a similar problem uh it
15:58
gives me some kind of like internal uh I
16:01
would say confidence or like something
16:03
that okay
16:05
this thing is proven you know what you
16:07
have to do is like just replicate on
16:08
your own data sets it should work and if
16:11
it doesn't work then it really then I
16:13
get very anxious like I have to look uh
16:15
into that yeah so every these all these
16:17
three steps I mean that's why I always
16:19
start with the simple thing because you
16:20
have a like kind of peace of mind that
16:21
at least you have some kind of Baseline
16:23
and then yeah obviously talk to you talk
16:26
to your teammates like because we we all
16:27
we do it and we really encourage like uh
16:30
talking in the team you know uh about
16:33
that like discussing okay I'm going to
16:35
do this one this is a new problem
16:37
um these are like uh things which I have
16:39
figured out to do can you just add some
16:41
points or like what are your ideas how
16:43
you're going to solve so basically
16:44
talking to other uh this really gives
16:47
like open
16:48
open to like more ideas like this
16:51
actually widen Your Horizon you know you
16:53
see okay yeah this can also be uh we can
16:56
try it out and then we select the again
16:58
the second thing is like which is the
17:00
second simplest model after this and
17:03
then you are going to that and but for
17:06
the third Point uh what happens with me
17:07
is like I actually first try to look at
17:10
try to find in Google okay is there any
17:12
similar problem I first look into that
17:14
that is my first instinct whenever I do
17:16
it then I find then I just keep it for
17:19
some time you know uh I can just leave
17:21
it first then I create a baseline
17:23
because for the Baseline I know what
17:24
things are going to be required because
17:25
that's a comfortable zone for me you
17:28
know you know I can create that and then
17:30
I'll jump to the third point you know
17:32
then taking the ideas from there and
17:34
then discussing uh then we just figured
17:36
so this is I would say my normal way to
17:39
work
17:40
um I keep thank you so much for making
17:42
time to chat today
17:44
um I'm going to cut this video together
17:45
and then I'm going to send it to you for
17:47
approval it's been an absolute pleasure
17:49
thanks a lot for this it was it was nice
17:52
nice meeting and talking to you