python - Efficient way to store millions of arrays, and peform IN check -

There are approximately 3 million arrays - or Python lists \ tuples (does not really make any difference). Each array has the following elements:

  ['string1', 'string2', 'string3', ...] # Fully, 10000 element

These arrays should be stored in some type of key-value storage.

Then, 3 million keys, each key represents 10000-element array.

Lists / Tuples or any other custom thing - it really does not matter what the arrays contain strings - UTF 8 or Unicode string, every 5 to 50 characters. There are about 3 million possible strings in them if they are really necessary, but more efficient and for operation, I have to like strings.

Although it is difficult to give you a full description of the data (it is complex and weird), it's similar to the synonyms - say we have 3 million words - like the dick key - and every word 10k synonyms for - or element of the list.

Similarly (not the actual synonyms but it will give you this idea):

  {'computer': ['pc', 'mac', 'laptop' , ...], # (10k fully) home ': [' building ',' cottage ',' tavern ', ...], # (another 10k) ...}

element - 'synonyms' - it can be sorted if it is sorted.

Later, after the arrays populated, there is a loop: we go through all the keys and check if something is in its value. For example, the user inputs the words 'computer' and 'laptop' - and we should immediately answer that the word 'laptop' is a synonym of 'computer'. The point here is that we have to check it millions of times, maybe maybe 20 lakh or just think that we have some random words - users of 'computer' and 'car', 'phone' and 'building' etc. There are lots . They can 'match' or they 'match'

So, in short - what do I need:

Store these data structures efficiently,
Quickly to be able Check whether something is in the array or not.

I should be able to keep the memory usage under 30 GB. Also I should be able to demonstrate all iterations in Xeon CPU in less than 10 hours.

About 0.1% of its incorrect answers - both are positive and negative - although they will not be able to reduce them or have them at all.

What's the best way here? Algorithms, links to code, anything is really appreciated. Apart from this - a friend of mine suggests using bluff filters or marisa - is that right? I have not done any of these work.

I mapped each unique string into a numeric ID of its & lt; For every 0.1% error rate, add approximately 20 bit per 20 bits * 10000 element * 3 lakh keys are 75 GB, if you have limited space, then store a little less accurate filter in memory and more accurate filter on the disk. Which the first filter says it can be a match. / P>

There are only 1.44 and midode; N and midod; N & Midot from ln ₂ (1 / and epsilon; ); > In your case & epsilon; = 0.001 per key> 2 (1 / and epicelon; ), therefore theoretical limit is the data structure of 99658 bit per key, or 10 bit per element, which is 298 , 9 74, 000,000 bits or 38 GB.

Then below the theoretical limit for 30 GB data structure, which is the number and performance of the entries you want, but

Six

Search This Blog

python - Efficient way to store millions of arrays, and peform IN check -

Comments

Post a Comment