1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

SQLite database: many-to-many optimization (simple but large example) without IDs available?

Discussion in 'Programming/Internet' started by Alopex, Oct 8, 2018.

  1. Alopex

    Alopex Guest

    I'm a beginner looking for people who can think through these things more easily to see where I've messed up. I have a many-to-many table that I am trying to optimize for speed of creation, speed of query, and minimize size - but sometimes this feels like I'm asking for too much! Using the typical Students:Classes table as an analogy, my table has the following constraints:

    • Student Table: Student Names, Text, (6 billion unique entries)
    • Class Names Table: Class Names, Text (100,000 unique entries)
    • Lookup needs to be fast on Student Names so an index or primary key constraint needs to be used
    • Size needs to be manageable. Both examples below are fine space-wise but much larger would be a problem
    • Must use SQLite for this problem, but if the answer can be solved more easily with another DB system, I'd like to hear about it

    I've tried to implement a few optimizations, including the executemany function for insertions, beginning transactions and committing in large batches, and some PRAGMAS like synchronous = off and journal = DELETE, but they seem to have minimal impact.

    In most examples, a bridge table is used to relate a unique integer ID in Student Table (student_ID) to a unique (class_ID) which I implemented with the following pseudo code. Since I do not know the student IDs/class IDs ahead of time, I derive them from the rowIds:

    1: Unique constraints and bridge/junction table, selecting rowID for studentID/classID after they've been inserted

    CREATE TABLE StudentTable (StudentName TEXT UNIQUE, ID INTEGER PRIMARY KEY) # ID = alias to rowid
    CREATE TABLE StudentClasses (studentID INTEGER, classID INTEGER, PRIMARY KEY (studentID, classID) WITHOUT ROWID)

    (For each row):
    INSERT OR IGNORE INTO StudentTable (StudentName) VALUES (?) , Student_Name #All student names
    INSERT OR IGNORE INTO ClassTable (ClassName) VALUES (?) , Class_Name #all class names
    SELECT ID FROM StudentTable WHERE StudentName = "Student_Name" #multiple names in row sometimes, so I select all here
    SELECT ID FROM ClassTable WHERE ClassName = "Class_Name" #same as above
    INSERT INTO StudentClasses VALUES (?,?) , student_id, class_id #insert all relationships from rowid derived IDs

    6 billion student names
    100,000 class names
    8 billion student:class relationships in StudentClasses

    Run time: Very slow due to unique constraints on Student Names and Class Names (estimated to be ~1 week, aborted before completion)
    Size: Minimal (constrained to unique entries on text, and integers for relationships)

    2: Single table with index on Student Names

    CREATE TABLE StudentClasses (StudentName TEXT, ClassName TEXT)
    #Insert all relationships
    INSERT INTO StudentClasses (StudentName, ClassName) VALUES (?,?) Student,Class
    #After all insertions:
    CREATE INDEX StudentIndex ON StudentClasses(StudentName)

    8 billion relationships between student and classes
    Additional index on Student Names

    Run time: 8x faster than example 1 (~15 hours)
    Size: 1.5x larger than example 1 (additional text (?) created with indexing the text field)

    Aside from these two approaches, is there something I'm missing? In particular, I understand that a normalized relationship with the junction table being a series of integer IDs is the most space efficient, but it is so incredibly slow to construct (at least the way I'm doing it) that I have to avoid it.

    I'd like to hear about any better approaches you can think of. For my particular project, I'm stuck with SQLite, but if you know of other DB managers that are better at handling this situation, I'd like to learn about the alternatives.

    Thanks for any and all insight into this!

    Login To add answer/comment

Share This Page