sql - Indexing & alternatives for low-selectivity columns -


what range of tactics available selecting records on low selectivity columns?

an example might orders table where, on many years, build large number of completed orders need select active orders. order might go through lifecycle such placed, stock-allocated, picked warehouse, despatched customer, invoiced , paid. order might additionally cancelled, held, etc. majority of records in final state (e.g. paid) might need select, say, allocated orders. in case sequential read slow.

similar questions on indexing
mysql: low cardinality/selectivity columns = how index?
do indexes suck in sql?
what indexes , how can use them optimize queries in database?
defining indexes: columns, , performance impact?
, numerous others decreasingly related.

the approaches have read (in stackoverflow , elsewhere) include

  • use bitmap index
  • use partial index (create index x on t(c2) c1='a')
  • use clustered index?
  • don't index low selectivity columns, use sequential read
  • partition data (e.g. several tables identical schema)
  • use supplementary table (e.g. active_customers(customer_id)

my current dbms doesn't support first 3 options listed above , remainder seem problematic - there other commonly used approaches?

update: i've seen - index low-selectivity column, ever select high-selectivity values.

i agree unreason's however branch. there things know case.

this called skew , skew kills. perfect use partial index you'd exclude 95% of paid invoices , index more interesting , selective stats. don't have that. can horizontally partition rows separate table/partitions need account row migration (moving 1 status another) , that's expensive. dbms has perform update, delete , insert change status. if you're high volume system hurt.

forget said whether or not index based on selectivity because putting index on rapidly changing column bad idea. index have hot blocks step 1's being removed , step 2's being inserted , oh btw, step 2's being removed @ same time step 3's. won't scale well.

i recommend vertically partitioning status separate table(s).

your invoice table have pk , columns except status.

your status can handle 2 ways. table have pk value fk invoice table, status , timestamp when entered status. best horizontally partitioned table on status. you'll have partition each status possible. finding or 1 "placed" status partition prune , read partition needs - small number of blocks. because row narrow, might 400 invoice statuses on single block. looking status of 1 invoice easy since there's global index on pk.

if rdbms doesn't support partitioning row migration, you'll need manage these partitions tables , delete 1 , insert another. you'll encapsulate these movements in transaction in procedure, keep data clean. every invoice in 1 , 1 status table. harder part querying invoice id, you'll have check every table see is.

you have choice can either write paid statuses or not. if it's partitioned table, can delete invoice invoice status table when moves paid. (of course you'll write paid record history table mentioned in bonus material). you'll outer join status table , nulls mean paid. if never query paid status, there's no reason make fast query.

bonus material

in either case you'll want keep track of these movements in reporting table. everytime update status, you'll want write history table. you'll want analyze call transit times. what's average time filled paid, month? increasing result of bad economy? what's transit time placed filled, month. summer months take longer because of missing bodies on vacation? point. updating column you're losing answers, you'll need embed history log procedures.


Comments

Popular posts from this blog

android - Spacing between the stars of a rating bar? -

html - Instapaper-like algorithm -

c# - How to execute a particular part of code asynchronously in a class -